Scalable Proximity Estimation and Link Prediction in Online Social Networks

Size: px
Start display at page:

Download "Scalable Proximity Estimation and Link Prediction in Online Social Networks"

Transcription

1 Scalable Proximity Estimation and Link Prediction in Online Social Networks Han Hee Song Tae Won Cho Vacha Dave Yin Zhang Lili Qiu The University of Texas at Austin {hhsong, khatz, vacha, yzhang, ABSTRACT Proximity measures quantify the closeness or similarity between nodes in a social network and form the basis of a range of applications in social sciences, business, information technology, computer networks, and cyber security. It is challenging to estimate proximity measures in online social networks due to their massive scale (with millions of users) and dynamic nature (with hundreds of thousands of new nodes and millions of edges added daily). To address this challenge, we develop two novel methods to efficiently and accurately approximate a large family of proximity measures. We also propose a novel incremental update algorithm to enable near real-time proximity estimation in highly dynamic social networks. Evaluation based on a large amount of real data collected in five popular online social networks shows that our methods are accurate and can easily scale to networks with millions of nodes. To demonstrate the practical values of our techniques, we consider a significant application of proximity estimation: link prediction, i.e., predicting which new edges will be added in the near future based on past snapshots of a social network. Our results reveal that (i) the effectiveness of different proximity measures for link prediction varies significantly across different online social networks and depends heavily on the fraction of edges contributed by the highest degree nodes, and (ii) combining multiple proximity measures consistently yields the best link prediction accuracy. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services Web-based services; J.4 [Computer Applications]: Social and Behavioral Sciences Sociology General Terms Algorithms, Human Factors, Measurement Keywords Social Network, Proximity Measure, Link Prediction, Embedding, Matrix Factorization, Sketch Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IMC 9, November 4 6, 29, Chicago, Illinois, USA. Copyright 29 ACM /9/...$... INTRODUCTION A social network [53] is a social structure modeled as a graph, where nodes represent people or other entities embedded in a social context, and edges represent specific types of interdependency among entities, e.g., values, visions, ideas, financial exchange, friendship, kinship, dislike, conflict or trade. Understanding the nature and evolution of social networks has important applications in a number of fields such as sociology, anthropology, biology, economics, information science, and computer science. Traditionally, studies on social networks often focus on relatively small social networks (e.g., [3, 3] examine co-authorship networks with about 5 nodes). Recently, however, social networks have gained tremendous popularity in the cyber space. Online social networks such as MySpace [4], Facebook [8] and YouTube [55] have each attracted tens of millions of visitors every month [44] and are now among the most popular sites on the Web [4]. The wide variety of online social networks and the vast amount of rich information available in these networks represent an unprecedented research opportunity for understanding the nature and evolution of social networks at massive scale. A central concept in the computational analysis of social networks is proximity measure, which quantifies the closeness or similarity between nodes in a social network. Proximity measures form the basis for a wide range of important applications in social and natural sciences (e.g., modeling complex networks [6, 3, 25, 42]), business (e.g., viral marketing [23], fraud detection []), information technology (e.g., improving Internet search [35], collaborative filtering [7]), computer networks (e.g., constructing overlay networks [45]), and cyber security (e.g., mitigating spams [22], defending against Sybil attacks [56]). Unfortunately, the explosive growth of online social networks imposes significant challenges on proximity estimation. First, online social networks are typically massive in scale. For example, MySpace has over 4 million user accounts [4], and Facebook has reportedly over 2 million active users world wide [9]. As a result, many proximity measures that are highly effective in relatively small social networks (e.g., the classic measure [26]) become computationally prohibitive in large online social networks with millions of nodes [48]. Second, online social networks are often highly dynamic, with hundreds of thousands of new nodes and millions of edges added daily. In such fast-evolving social networks, it is challenging to compute up-to-date proximity measures in a timely fashion. Approach and contributions. To address the above challenges, we develop two novel techniques, proximity sketch and proximity embedding, for efficient and accurate proximity estimation in large social networks with millions of nodes. We then augment these techniques with a novel incremental proximity update algorithm to enable near real-time proximity estimation in highly dynamic

2 social networks. Our techniques are applicable to a large family of commonly used proximity measures, which includes the aforementioned measure [26], as well as rooted PageRank [3, 3] and escape probability [5]. These proximity measures are known to be highly effective for many applications [3, 3, 5], but were previously considered computationally prohibitive for large social networks [48, 5]. To demonstrate the practical value of our techniques, we consider a significant application of proximity estimation: link prediction, which refers to the task of predicting the edges that will be added to a social network in the future based on past snapshots of the network. As shown in [3, 3], proximity measures lie right at the heart of link prediction. Understanding which proximity measures lead to the most accurate link predictions provides valuable insights into the nature of social networks and can serve as the basis for comparing various network evolution models (e.g., [6, 3, 25, 42]). Accurate link prediction also allows online social networks to automatically make high-quality recommendations on potential new friends, making it much easier for individual users to expand their social neighborhood. We evaluate the effectiveness of our proximity estimation methods using a large amount of real data collected in five popular online social networks: Digg [4], Flickr [2], LiveJournal [33], MySpace [4], and YouTube [55]. Our results show that our methods are accurate and can easily scale to handle large social networks with millions of nodes and hundreds of millions of edges. We also conduct extensive experiments to compare the effectiveness of a variety of proximity measures for link prediction in these online social networks. Our results uncover two interesting new findings: (i) the effectiveness of different proximity measures varies significantly across different networks and depends heavily on the fraction of edges contributed by the highest degree nodes, and (ii) combining multiple proximity measures using an off-the-shelf machine learning software package consistently yields the best link prediction accuracy. Paper organization. The rest of the paper is organized as follows. In Section 2, we develop techniques to efficiently and accurately approximate a large family of proximity measures in massive, dynamic online social networks. In Section 3, we describe link prediction techniques. In Section 4, we evaluate both proximity estimation and link prediction in five popular online social networks. In Section 5, we review related work. We conclude in Section SCALABLE PROXIMITY ESTIMATION Proximity measures are the basis for many applications of social networks. As a result, a variety of proximity measures have been proposed. The simplest proximity measures are based on either the shortest graph distance or the maximum information flow between two nodes. One can also define proximity measures based on node neighborhoods (e.g., the number of common neighbors). Finally, several more sophisticated proximity measures involve infinite sums over the ensemble of all paths between two nodes (e.g., measure [26], rooted PageRank [3, 3], and escape probability [5]). Compared with more direct proximity measures such as shortest graph distances and numbers of shared neighbors, pathensemble based proximity measures capture more information about the underlying social structure and have been shown to be more effective in social networks with thousands of nodes [3, 3, 5]. Despite the effectiveness of path-ensemble based proximity measures, it is computationally expensive to summarize the ensemble of all paths between two nodes. The state of the art in estimating path-ensemble based proximity measures (e.g., [5]) typically can only handle social networks with tens of thousands of nodes. As a result, recent works on proximity estimation in large social networks (e.g., [48]) either dismiss path-ensemble based proximity measures due to their prohibitive computational cost or leave it as future work to compare with these proximity measures. In this section, we address the above challenge by developing efficient and accurate techniques to approximate a large family of path-ensemble based proximity measures. Our techniques can handle social networks with millions of nodes, which are several orders of magnitude larger than what the state of the art can support. In addition, our techniques can support near real-time proximity estimation in highly dynamic social networks. 2. Problem Formulation Below we first formally define three commonly used path-ensemble based proximity measures: (i) measure, (ii) rooted PageRank, and (iii) escape probability. We then show that all three proximity measures can be efficiently estimated by solving a common subproblem, which we term the proximity inversion problem. In all our discussions below, we model a social network as a graph G = (V, E), where V is the set of nodes, and E is the set of edges. G can be either undirected or directed, depending on whether the social relationship is symmetric. measure. The measure [26] is a classic path-ensemble based proximity measure. It is designed to capture the following simple intuition: the more paths there are between two nodes and the shorter these paths are the stronger the relationship is (because there are more opportunities for the two nodes to discover and interact with each other in the social network). Given two nodes x, y V, the measure [x, y] is a weighted sum of the number of paths from x to y, exponentially damped by length to count short paths more heavily. Formally, we have [x, y] = X l= β l paths l x,y () where paths l x,y is the set of length-l paths from x to y, and β is a damping factor. Let A be the adjacency matrix of graph G, where j, if x, y E, A[x,y] = (2), otherwise. As shown in [3], the measures between all pairs of nodes (represented as a matrix ) can be derived as a function of the adjacency matrix A and the damping factor β as follows: = X β l Al = (I β A) I (3) l= where I is the identity matrix. Thus, in order to compute, we just need to compute the matrix inverse (I β A). Rooted PageRank. The rooted PageRank [3, 3] is a special instance of personalized PageRank [8,2]. It defines a random walk on the underlying graph G = (V, E) to capture the probability for two nodes to run into each other and uses this probability as an indicator of the node-to-node proximity. Specifically, given two nodes x, y V, the rooted PageRank RPR[x, y] is defined as the stationary probability of y under the following random walk: (i) with probability β RPR, jump to node x, and (ii) with probability β RPR, move to a random neighbor of current node. The rooted PageRank between all node pairs (represented as a matrix RPR) can be derived as follows. Let D be a diagonal matrix with D[i, i] = P j A[i, j]. Let T = D A be the adjacency matrix with row sums normalized to. We then have: RPR = ( β RPR )(I β RPR T) (4)

3 Therefore, to compute RPR, we just need to compute the matrix inverse (I β RPR T). Also note that the standard PageRank can be computed simply as the average of all the columns of RPR. Escape probability. The escape probability [5] is another pathensemble based proximity measure. Given two nodes x, y V, the escape probability EP[x, y] from x to y is defined as the probability that a random walk which starts from node x will visit node y before it returns to node x [6]. The escape probability EP[x, y] can be directly derived from the rooted PageRank as follows. EP[x, y] = Q[x, y] Q[x, x]q[y,y] Q[x,y]Q[y, x] where matrix Q = RPR/( β RPR ) = (I β RPR T). As shown in [6], when the underlying graph G = (V, E) is undirected, the escape probability EP is also closely related to several other random walk induced proximity or distance measures: effective conductance EC, effective resistance ER, and commute time CT. Specifically, we have: (5) EC[x, y] = N(x) EP[x, y] (6) ER[x, y] = /EC[x, y] (7) CT[x, y] = 2 E ER[x, y] (8) The common subproblem: proximity inversion. From the above discussions, it is evident that the key to estimating all three pathensemble based proximity measures is to efficiently compute elements of the following matrix inverse: X P = (I βm) = β l M l (9) l= where M is a sparse nonnegative matrix with millions of rows and columns, I is an identity matrix of the same size, and β is a damping factor. We term this common subproblem the proximity inversion problem. 2.2 Scalable Proximity Inversion The key challenge in solving the proximity inversion problem (i.e., computing elements of matrix P = (I βm) ) is that while M is a sparse matrix, P is a dense matrix with millions of rows and columns. It is thus computationally prohibitive to compute and/or store the entire P matrix. To address the challenge, we first develop two novel dimensionality reduction techniques to approximate elements of P = (I βm) based on a static snapshot of M: proximity sketch and proximity embedding. We then develop an incremental proximity update algorithm to approximate elements of P in an online setting when M continuously evolves Preparation We first present an algorithm to approximate the sum of a subset of rows or columns of P = (I βm) efficiently and accurately. We use this algorithm as a basic building block in both proximity sketch and proximity embedding. Algorithm. Suppose we want to compute the sum of a subset of columns: P i S P[, i], where S is a set of column indices. We first construct an indicator column vector v such that v[i] = P for i S and v[j] = for j S. The sum of columns P[, i] is simply P v and can be approximated as: i S P v = (I βm) v = X β l M l v l= lx max l= β l M l v () where l max bounds the maximum length of the paths over which the summation is performed. P[x,y] is hashed into entry S k [x,g k(y)] in each hash table S k(k=,..., H) So S k[x,g k(y)] gives an upper bound on P[x,y] Estimate P[x,y] by taking the min upper bound in all H hash tables P[x,y] P m x m S k[x,g k(y)] +=P[x,y] S k Figure : Proximity sketch Similarly, to compute the sum of a subset of rows P i S P[i, ], we first construct an indicator row vector u such that u[i] = for i S and u[j] = for j S. We then approximate the sum of rows P i S P[i, ] = u P as: u P = u (I βm) = X β l u M l l= lx max l= β l u M l () In one extreme where S contains all the column indices, we can compute the sum of all columns in P. This is useful for computing the PageRank (which is the average of all columns in the RPR matrix). In the other extreme where S contains only one element, we can efficiently approximate a single row or column of P. Complexity. Suppose M is an m-by-m matrix with n non-zeros. Computing the product of sparse matrix M and a dense vector v takes O(n) time by exploiting the sparseness of M. So it takes O(n l max) time to compute {M l v l =,..., l max} and approximate Pv. Note that the time complexity is independent of the size of the subset S. The complexity for computing up is identical. Note however that the above approximation algorithm is not efficient for estimating individual elements of P. In particular, even if we only want a single element P[x,y], we have to compute either a complete row P[x, ] or a complete column P[, y] in order to obtain an estimate of P[x, y]. So we only apply the above technique for preprocessing. We will develop several techniques in the rest of this section to estimate individual elements of P efficiently. Benefits of truncation. We achieve two key benefits by truncating the infinite expansion P l= βl M l to form a finite expansion P l max l= βl M l. First, we completely eliminate the influence of paths with length above l max on the resulting sums. This is desirable because as pointed out in [3,3], proximity measures that are unable to limit the influence of overly lengthy paths tend to perform poorly for link prediction. Second, we ensure that P l max l= βl M l is always finite, whereas elements of P l= βl M l may reach infinity when the damping factor β is not small enough Proximity Sketch Our first dimensionality reduction technique, proximity sketch, exploits the mice-elephant phenomenon that frequently arises in matrix P in practice, i.e., most elements in P are tiny (i.e., mice) but few elements are huge (i.e., elephants). Algorithm. Figure shows the data structure for our proximity sketch, which consists of H hash tables: S,, S H. Each S k is a 2-dimensional array with m rows and c m columns. A column hash function g k : {,, m} {,, c} is used to hash each element in P m m (P[x,y]) into an element in S k m c (S k [x, g k (y)]). We ensure that different hash functions g k ( ) (k =,, H) are two-wise independent. In each S k, each element P[x, y] is added to entry S k [x, g k (y)]. Thus, S k [a, b] = X P[a, y] (2) y: g k (y)=b

4 Note that each column of S k : S k [, b] = P y:g k (y)=b P[, y] can be computed efficiently as described in Section Since P is a nonnegative matrix, for any x, y V and any k [, H], S k [x, g k (y)] is an upper bound for P[x, y] according to Eq. 2. We can therefore estimate P[x,y] by taking the minimum upper bound in all H hash tables in O(H) time. That is: ˆP[x, y] = min S k [x, g k (y)] (3) k Probabilistic accuracy guarantee. Our proximity sketch effectively summarizes each row of P : P[x, ] using a count-min sketch []: {S k [x, ] k =,, H}. As a result, we provide the same probabilistic accuracy guarantee as the count-min sketch, which is summarized in the following theorem (see [] for detailed proof). THEOREM. With H = ln hash tables, each with c = e δ ǫ columns, the estimate ˆP[x, y] guarantees: (i) P[x, y] ˆP[x, y]; and P (ii) with probability at least δ, ˆP[x, y] P[x,y] + ǫ P[x, z]. z Therefore, as long as P[x, y] is much larger than ǫ Pz P[x, z], the relative error of ˆP[x, y] is small with high probability. Extension. If desired, we can further reduce the space requirement of proximity sketch by aggregating the rows of S k (at the cost of lower accuracy). Specifically, we associate each S k with a row hash function f k ( ). We then compute R k [a, b] = X S k [x, b] (4) x: f k (x)=a and store {R k } (instead of {S k }) as the final proximity sketch. Clearly, we have R k [a, b] = P P x: f k (x)=a y: g k (y)=b P[x,y]. For any x,y V, we can then estimate P[x, y] as ˆP[x, y] = min R k [f k (x),g k (y)] (5) k Proximity Embedding Our second dimensionality reduction technique, proximity embedding, applies matrix factorization to approximate P as the product of two rank-r factor matrices U and V : P m m U m r V r m (6) In this way, with O(2 m r) total state for factor matrices U and V, we can approximate any P[x, y] in O(r) time as: rx ˆP[x, y] = U[x, k] V [k, y] (7) k= Our technique is motivated by recent research on embedding network distance (e.g., end-to-end round-trip time) into low-dimensional space (e.g., [32, 34, 43, 49]). Note however that proximity is the opposite of distance the lower the distance the higher the proximity. As a result, techniques effective for distance embedding do not necessarily work well for proximity embedding. Algorithm. As shown in Figure 2(a), our goal is to derive the two rank-r factor matrices U and V based on only a subset of rows P[L, ] and columns P[, L], where L is a set of indices (which we term the landmark set). We achieve this goal by taking the following five steps:. Randomly select a subset of l nodes as the landmark set L. The probability for a node i to be included in L is proportional to the PageRank of node i in the underlying graph. Note that We also consider uniform landmark selection, but it yields worse accuracy than PageRank based landmark selection (see Section 4). (a) goal: approximate P as the product of two rank r matrices U, V by only computing a subset of rows P[L,*] and columns P[*,L] P[L,L] P[L,L] P[*,L] ~ * U[L,*] P[L,*] P V[*,L] ~ (b) factorize P[L,L] to get U[L,*], V[*,L] U[L,*] U[L,*] P[L,*] ~ * U P[*,L] * (d) obtain V from P[L,*] and U[L,*] ~ U * V V[*,L] V V[*,L] (c) obtain U from P[*,L] and V[*,L] Figure 2: Proximity embedding the PageRank for all the nodes can be precomputed efficiently using the finite expansion method in Section Compute sub-matrices P[L, ] and P[, L] efficiently by computing each row P[i, ] and each column P[, i] (i L) separately as described in Section As shown in Figure 2(b), apply singular value decomposition (SVD) to obtain the best rank-r approximation of P[L, L]: P[L, L] U[L, ] V [, L] (8) 4. Our goal is to find U and V such that U V is a good approximation of P. As a result, U V [, L] should be a good approximation of P[, L]. We can therefore find U such that U V [, L] best approximates sub-matrix P[, L] in least-squares sense (shown in Figure 2(c)). Given the use of SVD in step 3, the best U is simply U = P[, L] V [, L] T (9) 5. Similarly, find V such that U[L, ] V best approximates submatrix P[L, ] in least-squares sense (shown in Figure 2(d)), which is simply V = U[L, ] T P[L, ] (2) Accuracy. Unlike proximity sketch, proximity embedding does not provide any provable data-independent accuracy guarantee. However, as a data-adaptive dimensionality reduction technique, when matrix P is in fact low-rank (i.e., having good low-rank approximations), proximity embedding has the potential to achieve even better accuracy than proximity sketch. Our empirical results in Section 4 suggest that this is indeed the case for the measure Incremental Proximity Update To enable online proximity estimation, we periodically checkpoint M and use the above dimensionality reduction techniques to approximate P for the last checkpoint of M. Between two checkpoints, we apply an incremental update algorithm to approximate P = (I β M ), where M = M + is the current matrix. Our algorithm is based on the second-order approximation of P : P = [I β(m + )] (I βm) + β + β 2 ( M + M + 2 ) (2)

5 The second-order approximation works well as long as has only few non-zero elements and β is small, making higher order terms negligible. To estimate an individual element P [x, y], we simply use: P [x, y] P[x, y] + β [x, y] + X β 2 [x, k]m[k, y]+ X β 2 M[x, k] [k, y] + X k: [k,y] k: [x,k] k: [x,k] β 2 [x, k] [k, y] (22) If we checkpoint M frequently enough, the difference between the last checkpoint M and the current matrix M will be quite small. In other words, the difference matrix is likely to be sparse. As a result, we expect row [x, ] and column [, y] to have few non-zero elements. By leveraging such sparseness, we can efficiently compute Eq. 22 in an online fashion. We demonstrate the efficiency and accuracy of our incremental proximity update algorithm in Section LINK PREDICTION TECHNIQUES We use link prediction as a significant application of our proximity estimation methods. Our goal is to understand (i) the effectiveness of various proximity measures in the context of link prediction, and (ii) the benefit of combining multiple proximity measures. In this section, we summarize the link predictors and the proximity measures we use. 3. Link Predictors We consider two types of link predictors: (i) basic link predictor that uses a single proximity measure, and (ii) composite link predictor that uses multiple proximity measures. Basic link predictors. A basic link predictor consists of a proximity measure prox[, ], and a threshold T. Given an input graph G = (V, E) (which models a past snapshot of a given social network), a node pair x,y E is predicted to form an edge in the future if and only if the proximity between x and y is sufficiently large, i.e., prox[x, y] T. Composite link predictors. A composite link predictor uses machine learning techniques to make link predictions based on multiple proximity measures. We use the WEKA machine learning package [2] to automatically generate composite link predictors using a number of machine learning algorithms, including the REPtree decision tree learner, J48 decision tree learner, JRip rule learner, support vector machine (SVM) learner, and Adaboost learner. The results are consistent across different learners we use. So we only report the results of the REPtree decision tree learner. REPtree is a variant of the commonly used C4.5 decision tree learning algorithm [46]. It builds a decision tree using information gain and prunes it using reduced error pruning. It allows direct control on the depth of the learned decision tree, making it easy to visualize and interpret the resulting composite link predictor. 3.2 Proximity Measures We consider three classes of proximity measures summarized in Table, which are based on (i) graph distance, (ii) node neighborhood, and (iii) ensemble of paths, respectively. Notations. We model a social network as a graph G = (V, E), where V is the set of nodes, and E is the set of edges. G can be either directed or undirected. For a node x, let N(x) = {y x,y E} be the set of neighbors x has in G. Similarly, let N (x) = {y y, x E} be the set of inverse neighbors x has in G (i.e., nodes that have x as their neighbors). Let A be the adjacency matrix for G (defined in Eq. 2). Let T = D A be the adjacency graph distance common neighbors Adamic/Adar preferential attachment PageRank product [x, y] = negated distance of the shortest path from x to y [x, y] = N(x) N(y) AA[x, y] = P z N(x) N(y) log N(z) PA[x, y] = N(x) N(y) [x, y] = PR(x) PR(y), where PR(x) = d V + d P PR(z) z N (x) N(x) [x, y] = P l= βl paths l x,y we have: = (I βa) I Table : Summary of proximity measures matrix with row sums normalized to, where D is a diagonal matrix with D[i, i] = P j A[i, j]. Graph distance based proximity measure. Perhaps the most direct metric for quantifying how close two nodes are is the graph distance. We thus define a proximity measure [x, y] as the negative of the shortest-path distance from x to y. Note that the use of negated (instead of original) shortest-path distance ensures that the proximity measure [x, y] increases as x and y get closer. Note that it is inefficient to apply Dijkstra s algorithm to compute shortest path distance from x to y when G has millions of nodes. Instead, we exploit the small-world property [27] of the social network and apply expanded ring search to compute the shortest path distance from x to y. Specifically, we initialize S = {x} and D = {y}. In each step we either expand set S to include its members neighbors (i.e., S = S {v u, v E u S}) or expand set D to include its members inverse neighbors (i.e., D = D {u u, v E v D}). We stop whenever S D = the number of steps taken so far gives the shortest path distance. For efficiency, we always expand the smaller set between S and D in each step. We also stop when a maximum number of steps is reached (set to 6 in our evaluation). Node neighborhood based proximity measures. We define four proximity measures based on node neighborhood. Common neighbors. For two nodes x and y, they are more likely to become friends when the overlap of their neighborhoods is large. The simplest form of this approach is to count the size of the intersection: [x, y] = N(x) N(y). Adamic/Adar. Like common neighbors, Adamic/Adar [] also tries to measure the size of the intersection of two neighborhoods. However, Adamic/Adar also takes rareness into account, giving more weights to the common node with smaller number of friends: AA[x, y] = P z N(x) N(y) log N(z). Preferential attachment. The preferential attachment is based on the idea that having a new neighbor is proportional to the size of the current neighborhood. Moreover, the probability of two users becoming friends is proportional to the product of the number of the current friends. We therefore define a proximity measure: PA[x, y] = N(x) N(y). PageRank product. PageRank is developed to analyze the hyperlink structure of Web pages by treating a hyperlink as a vote. The PageRank of a node depends on the count of inbound links and the PageRank of outbound neighbors. Formally, the PageRank of a node x, denoted as PR(x), is defined recursively on G = (V, E) as PR(x) = d V + d X z N (x) PR(z) N(x) (23) where d is a damping factor. We define the PageRank product of two nodes x and y as the product of two PageRank values: [x, y] = PR(x) PR(y).

6 Path-ensemble based proximity measures. We use the measure ([x, y]) as a path-ensemble based proximity measure (described in Section 2.). We use the measure as the representative of path-ensemble based proximity measures for two main reasons. First, as shown in [3, 3], the measure is the more effective than other path-ensemble based proximity measures such as the rooted PageRank. Second, our results in Section 4 show that the accuracy of our proximity estimation methods is the highest for the measure. 4. EVALUATION 4. Dataset Description Snapshot # of Conn- # of # of Added Asymmetric Network Date ected Nodes Links Links Link Fraction Digg 9/5/28 535,7 4,432,726 /25/28 567,77 4,83, , % //28 567,77 4,94,4 75,958 Flickr 3//27,932,735 26,72,29 4/5/27 2,72,692 3,393,94 3,69, % 5/8/27 2,72,692 32,399,243 2,5,33 Live- /3/28,769,493 6,488,262 Journal 2/5/28,769,543 6,92,736,566, % /3/29,769,543 62,843,995 3,93,64 MySpace 2//28 2,28,945 89,38,628 //29 2,37,773 9,629,452,845,898 % 2/4/29 2,37,773 89,34,78 696,6 YouTube 4/3/27 2,2,28 9,762,825 6/5/27 2,532,5 3,7,64 3,254,239 % 7/23/27 2,532,5 5,337,226 2,32,62 Wikipedia 9/3/26,636,96 28,95,37 2/3/26,758,323 33,974,78 5,24,57 83.% 4/6/27,758,323 38,349,329 4,374,62 Table 2: Dataset summary We carry out our evaluation on five popular online social networks: Digg [4], Flickr [2], LiveJournal [33], MySpace [4], and YouTube [55]. For comparison, we also examine the hyperlink structure of Wikipedia [54]. For each network, we conduct three crawls and make three snapshots of the network. Table 2 summarizes the characteristics of the three snapshots for each of the networks. Note that, for the purpose of link prediction, we only use connected nodes (i.e., nodes with at least one incoming or outgoing friendship link), rather than considering all the crawled nodes. Another point to note is that since link prediction implies that based on one snapshot of the network, we predict the new links that are formed in the next snapshot, the same set of users should appear in two consecutive snapshots. Hence, for a growing network, the number of users appearing in the last snapshot that we create may be less than the total number of users (to match the previous snapshot). Lastly, although there can be both link additions and deletions between two snapshots, since the goal of link prediction is to predict those that get added, we explicitly show the number of added links between two consecutive snapshots in Table 2. Digg [4] is a website for users to share interesting Web content by posting a link to it. The posted link can be voted as either positive ( digg ) or negative ( bury ) by other users. Digg allows a user to become a fan of other users, which we consider as a friendship relation. All the friendship links together form a directed social graph. Overall, 58.3% directly connected user pairs in Digg have asymmetric friendship (i.e., friendship link only exists in one direction between two users). We obtained the entire list of.9 million users in September 28. We crawled friendship links among these users using the Digg API [5] in September 28, October 28, and November 28. The resulting snapshots contain more than 5, connected users (i.e., users with at least one incoming or outgoing friendship link) out of.9 million crawled users. Flickr [2] is a popular photo-sharing website. Flickr allows users to add other people as contacts to form a directed social link. We use the Flickr dataset collected by [36], which represents a breadth first search on the graph from a set of seed users. The dataset gives the growth of Flickr for 4 days and contains 33 million links among 2.3 million users. We treat the first 25 days as the bootstrap period to ensure that the crawl has sufficiently stabilized. We then partition the remaining dataset into three snapshots separated approximately by 4 days each. Note that, the third snapshot of Flickr contains links for the same 2.7 million users that appear in the second snapshot (and not the entire 2.3 million users). LiveJournal [33] is a Web community that allows its users to post entries to personal journals. LiveJournal also acts as a social networking site, where a user can become a fan of another LiveJournal user. We consider this fan relationship as a directed friendship link in the social graph. Since LiveJournal does not provide a complete list of users, we obtained a list of active users who have published posts by analyzing periodic RSS announcements of recently updated journals starting from July 28. We then used the LiveJournal API to gather friendship information of 2.2 million active users in November 28, December 28, and January 29. The resulting snapshots have about.8 million connected users who have non-zero friendship links. MySpace [4] is a social networking site where users can interact with each other by personalizing pages, commenting on others photos and videos, and making friends. For two MySpace users to become friends, both parties have to agree. Therefore, the social links in MySpace are undirected and thus symmetric. We crawled million MySpace users out of over 4 million users by taking the first million user IDs in December 28, January 29, and February 29. After discarding all the inactive, deleted, private, and solitary MySpace IDs, we get information for approximately 2. million users in each resulting snapshot. YouTube [55] is a popular video-sharing website. Registered users can connect with others by creating friendship links. We use the undirected version of the social graph collected by [36], which covers the growth of YouTube for 65 days with 8 million added links among 3.2 million users. We divide the dataset into three snapshots separated by 45 days each. Note that the third snapshot of YouTube in Table 2 contains links for 2.5 million users that also appear in the second snapshot (and not the entire 3.2 million users). Wikipedia [54] is an online encyclopedia which takes users collaboration to build content. Different wiki pages are connected through hyperlinks. We compare Wikipedia s hyperlink structure against social graphs of users from the previous five online social networks. Similar to general Web pages, most links in Wikipedia are asymmetric. We use the data collected by [36] over a six-year period from 2 to 27, which contains 38 million links connecting.8 million pages. We extract three snapshots separated approximately by 9 days each. 4.2 Proximity Estimation Algorithms In this section, we evaluate the accuracy and scalability of our proximity estimation methods using the above six datasets. We present results for and RPR (defined in Section 2). The accuracy for escape probability (EP) is similar to RPR (due to their close relationship in Eq. 5) and is omitted in the interest of brevity. Accuracy metrics. We quantify the estimation error using three different metrics: (i) Normalized Absolute Error (NAE) (defined as est i actual i ), (ii) Normalized Mean Absolute Error (NMAE) mean i (actual P i ) i est i actual i P i actual i (defined as ), and (iii) Relative Error (defined as

7 Network PageRank based selection Uniform selection Digg.5.23 Flickr..238 LiveJournal MySpace.6.32 YouTube Wikipedia Table 3: NMAE of different landmark selection schemes. est i actual i actual i ), where est i and actual i denote the estimated and actual values of the proximity measure for node pair i, respectively. Since it is expensive to compute the actual proximity measures over all the data points, we randomly sample, data points by first randomly selecting 2 rows from the proximity matrix and then selecting 5 elements from each of these rows. We then compute errors for these, data points Proximity Embedding We first evaluate the accuracy of proximity embedding. We aim to answer the following questions: (i) How accurate is proximity embedding in estimating and RPR? (ii) How many dimensions and landmarks are required to achieve high accuracy? (iii) How does the landmark selection algorithm affect accuracy? Parameter settings. Throughout our evaluation, we use a damping factor of β =.5, l max = 6, and 6 landmarks unless otherwise specified. We also vary these parameters to understand their impact. By default, we select landmarks based on the PageRank of each node. Specifically, we first compute PageRank for each node and normalize the sum of PageRank of all nodes to. We then use the normalized PageRank as the probability of assigning a node as a landmark. In this way, nodes with high PageRank values are more likely to become landmarks. For comparison, we also examine the performance of uniform landmark selection, which selects landmarks uniformly at random. Varying the number of dimensions. Figure 3 plots the of normalized absolute errors in approximating the measure as we vary the number of dimensions from 5 to 6. We make the following two key observations. First, for all six datasets the normalized absolute error is small: in more than 95% cases the normalized absolute error is within.5 and NMAE is within.5 except YouTube. The error in YouTube is higher because its intrinsic dimensionality is higher as analyzed in Figure 6 (see below). Second, as we would expect, the error decreases with the number of dimensions. The reduction is more significant in the YouTube dataset, because the other datasets have very low intrinsic dimensionality and using only 5 dimensions already gives low approximation error, whereas YouTube has higher intrinsic dimensionality and increasing the number of dimensions is more helpful. Relative errors. Figure 4 further plots the of relative errors using 6 dimensions. We take top %, 5%, and % of the randomly selected data points and generate the for each of the selections. In all datasets, we observe that the relative errors are smaller for elements with larger values. This is desirable because larger elements play a more important role in many applications and are thus more important to estimate accurately. Uniform landmark selection. Table 3 compares the NMAE of PageRank based landmark selection and uniform selection. PageRank based selection yields higher accuracy than uniform selection. It reduces NMAE by 35% for Digg, 95.8% for Flickr, 83.3% for LiveJournal, 5% for MySpace, 6% for YouTube, and 9% for Wikipedia. The reason is that high-pagerank nodes are well connected, and it is less likely for nodes to be far away from all such landmarks, thereby improving the estimation accuracy. Threshold for Large Values Network % row sum.% row sum % row sum Digg Flickr LiveJournal MySpace YouTube Wikipedia (a) NMAE of proximity embedding Threshold for Large Values Network % row sum.% row sum % row sum Digg Flickr LiveJournal MySpace YouTube Wikipedia (b) NMAE of proximity sketch Table 4: Comparing proximity embedding and proximity sketch in estimating large values. Threshold for Large RPR Values Network % row sum.% row sum % row sum Digg Flickr LiveJournal MySpace YouTube Wikipedia (a) NMAE of proximity embedding Threshold for Large RPR Values Network % row sum.% row sum % row sum Digg Flickr LiveJournal MySpace YouTube Wikipedia (b) NMAE of proximity sketch Table 5: Comparing proximity embedding and proximity sketch in estimating large RPR values. Varying the number of landmarks. Figure 5 shows the NMAE as we vary the number of landmarks. As before, we use PageRank based landmark selection. For all the datasets that we use, NMAE values decrease with the number of landmarks. The decrease is sharp when the number of landmarks is small, and then tapers off as the number of landmarks reaches -2. In all cases, 6 landmarks are large enough and further increasing the value yields only marginal benefit if any. Estimating large values. Table 4(a) shows the accuracy of proximity embedding in estimating values larger than %,.% and % of their corresponding row sums in the matrix. As we can see, for elements larger than, the NMAE is low (the largest one is.22 for YouTube). In comparison, for elements larger than % and.% of the row sums, the NMAE is often larger (e.g., the corresponding values for LiveJournal are and.25). Manual inspection suggests that many values larger than % of row sum involve a direct link between two nodes in an isolated island of the network that cannot reach any landmarks. For such node pairs, the proximity embedding yields an estimate of, thus seriously inflating the NMAE. Fortunately, large values are quite rare in each row. As a result, the NMAE is low when we consider all elements in the matrix. Estimating large Rooted PageRank values. Table 5(a) shows the accuracy of proximity embedding in estimating Rooted PageRank

8 Dims (NMAE=.) Dims (NMAE=.) 2 Dims (NMAE=.) 3 Dims (NMAE=.) 6 Dims (NMAE=.) (a) Digg 5 Dims (NMAE=.) Dims (NMAE=.) 2 Dims (NMAE=.) 3 Dims (NMAE=.) 6 Dims (NMAE=.) (b) Flickr 5 Dims (NMAE=.5) Dims (NMAE=.4) 2 Dims (NMAE=.3) 3 Dims (NMAE=.2) 6 Dims (NMAE=.) (c) LiveJournal Dims (NMAE=.) Dims (NMAE=.) 2 Dims (NMAE=.) 3 Dims (NMAE=.) 6 Dims (NMAE=.) (d) MySpace.97 5 Dims (NMAE=.77) Dims (NMAE=.43) 2 Dims (NMAE=.24) 3 Dims (NMAE=.) 6 Dims (NMAE=.2) (e) YouTube.97 5 Dims (NMAE=.) Dims (NMAE=.) 2 Dims (NMAE=.) 3 Dims (NMAE=.) 6 Dims (NMAE=.) (f) Wikipedia Figure 3: Normalized absolute errors with a varying number of dimensions ( measure, β =.5, and 6 landmarks) Top %.82 Top 5%.8 Top % (a) Digg.84 Top %.82 Top 5%.8 Top % (b) Flickr.84 Top %.82 Top 5%.8 Top % (c) LiveJournal Top %.82 Top 5%.8 Top % (d) MySpace Top %.82 Top 5%.8 Top % (e) YouTube Top %.82 Top 5%.8 Top % (f) Wikipedia Figure 4: Relative errors for top %, 5%, and % largest values ( measure, β =.5, 6 landmarks, and 6 dimensions). values larger than %,.%, and % of their corresponding row sums in the RPR matrix. We observe that the NMAE for RPR is much larger than the NMAE for. To understand why proximity embedding performs well on but not on RPR, we plot the fraction of total variance captured by the best rank-k approximation to the inter-landmark proximity matrices [L, L] and RPR[L, L] in Figure 6, where L is the set of landmarks. Note that the best rank-k approximation to a matrix can be easily computed through the use of singular value decomposition (SVD). The smaller the number of dimensions (i.e., k) it takes to capture most variance of the matrix, the lower the intrinsic dimensionality the matrix has. As we can see, for LiveJournal, even 3 dimensions can capture over 99% variance for, whereas it takes 59 dimensions to achieve similar approximation accuracy for rooted PageRank. This indicates that the RPR matrix is not lowrank, whereas the matrix exhibits low-rank property, which makes proximity embedding work well Proximity Sketch Now we evaluate the accuracy of proximity sketch. We use H = 3 hash tables and c = 6 columns in each table. Estimating large values. Table 4(b) shows the NMAE of proximity sketch in estimating values larger than %,.%, and % of row sum. Comparing Table 4(a) and (b), we observe

9 NMAE e # Landmarks (a) Digg and MySpace Digg MySpace NMAE Flickr LiveJournal YouTube Wikipedia # Landmarks (b) Flickr, LiveJournal, YouTube, Wikipedia Figure 5: NMAE comparison of different number of landmarks under PageRank based landmark selection. Fraction of Variance , beta=.5, beta=.5 Rooted PageRank Fraction of Variance , beta=.5, beta=.5 Rooted PageRank Fraction of Variance , beta=.5, beta=.5 Rooted PageRank # Dimensions (a) Digg # Dimensions (b) Flickr # Dimensions (c) LiveJournal Fraction of Variance , beta=.5, beta=.5 Rooted PageRank # Dimensions (d) MySpace Fraction of Variance , beta=.5, beta=.5 Rooted PageRank # Dimensions (e) YouTube Fraction of Variance , beta=.5, beta=.5 Rooted PageRank # Dimensions (f) Wikipedia Figure 6: Fraction of total variance captured by the best rank-k approximation to inter-landmark proximity matrices [L, L] and RPR[L, L]. that proximity sketch generally performs better than proximity embedding for large values (i.e., greater than % of row sums), and performs worse for small values (i.e., greater than.% and of row sums). This is consistent with our expectation, since proximity sketch is designed to approximate large elements. This also suggests that the two algorithms are complimentary and we can potentially have a hybrid algorithm that chooses the results among the two algorithms based on the magnitude of the estimated values. Estimating large Rooted PageRank values. Table 5(b) shows the NMAE of proximity sketch in estimating RPR. As for, proximity sketch yields lower error than proximity embedding for large elements (i.e., larger than %, and.% of row sums) and higher error for small elements (i.e., larger than ). This confirms that proximity sketch is effective in estimating large elements Incremental Proximity Update Next, we evaluate the accuracy of our incremental proximity update algorithm. We use two checkpoints of crawl data that differ by one day: May 7 8, 27 for Flickr, July 22 23, 27 for YouTube, and April 5 6, 27 for Wikipedia. The one-day gap between two checkpoints yields.3%,.5%, and.5% difference between M and M for Flickr, YouTube, and Wikipedia, respectively. We do not use LiveJournal, MySpace, and Digg, since we have no checkpoints that are one day apart for these sites. Figure 7 plots the of normalized absolute errors and relative errors for the measure using the incremental update algorithm in conjunction with proximity embedding (denoted as Embed + inc. updates ). For comparison, we also plot the errors of using proximity embedding alone (denoted as Prox. embedding only ). As we can see, the curves corresponding to incremental update are very close to those of proximity embedding. This indicates that we can use the incremental algorithm to efficiently and accurately update the proximity matrix as M changes dynamically and only re-compute the proximity matrix periodically over multiple days Scalability Table 6 shows the computation time for proximity embedding, proximity sketch, incremental proximity embedding, common neighbor, and shortest path computation. The measurements are taken on an Intel Core 2 Duo 2.33GHz machine with 4GB memory running Ubuntu Linux kernel v We explicitly distinguish the query time for positive and negative samples A node pair A,B is considered a positive sample if there is a friendship link from A to B; otherwise it is considered a negative sample.

10 Embed. + Inc. update (NMAE=.) Prox. embedding only (NMAE=.) (a) Flickr, normalized absolute errors Embed. + Inc. update Prox. embedding only (b) Flickr, relative errors, top %.92.9 Embed. + Inc. update (NMAE=.4) Prox. embedding only (NMAE=.2) (c) YouTube, normalized absolute errors Embed. + Inc. update Prox. embedding only (d) YouTube, relative errors, top % Embed. + Inc. update (NMAE=.) Prox. embedding only (NMAE=.) (e) Wikipedia, normalized absolute error Embed. + Inc. update Prox. embedding only (f) Wikipedia, relative errors, top % Figure 7: Normalized absolute errors and relative errors of incremental update algorithm, measure. We make several observations. First, the query time for both proximity embedding and proximity sketch is small, even smaller than common neighbor and shortest path computation, which are traditionally considered much cheaper operations than computing measure. Second, the preprocessing time of proximity sketch and proximity embedding increases linearly with the number of links in the dataset. As preprocessing can be done in parallel, we use the Condor system [7] for datasets with a large number of links. Running preprocessing step simultaneously from 5 machines, we observe that the total preprocessing time goes down to less than 3 minutes, even for large networks such as MySpace and LiveJournal. Furthermore, the pre-processing only needs to be done periodically (e.g., once every few days). For symmetric networks, such as MySpace and YouTube, proximity embedding only needs to compute either P[L, ] or P[, L], reducing the preparation time to half of that of proximity sketch. Finally, the incremental proximity update algorithm eliminates pre-computation time at the cost of increased query time. However, even the increased query time is much smaller than shortest path computation. These results demonstrate the scalability of our approaches Summary To summarize, our evaluation shows that proximity embedding and proximity sketch are complementary: the former is more accurate in estimating random samples, whereas the latter is more effective in estimating large elements. Comparing and RPR, proximity embedding yields much more accurate estimation for than for RPR due to the low-rank property in the matrix. In particular, proximity embedding yields highly accurate estimation for random samples (achieving NMAE of within.2). Note that for the purpose of link prediction, it is insufficient to only estimate few large values because (i) more than 5% of node pairs having measure greater than.% of row sums already have friendship (i.e., links) and (ii) for the remaining 85% node pairs, which are not already friends, the probability of nodepairs with the large values becoming friends is only.2%, whereas the average probability of random node-pairs becoming friends is.57%. So we use proximity embedding for estimating for link prediction in Section Link Prediction Evaluation 4.3. Evaluation Methodology In the friendship link prediction problem, we construct the positive set as the set of user pairs that were not friends in the first snapshot, but become friends in the second snapshot. The negative set consists of user pairs that are friends in neither snapshots. Metrics. We measure link prediction accuracy in terms of false negative rate (FN), and (FP): FN = FP = #of missed friend links #of new-friend links #of incorrectly predicted friend links #of non-friend links We also observe that not all proximity measures are applicable to all node pairs. For example, common neighbor is applicable only when nodes are two hops away. Thus we introduce another metric, called applicable ratio (AR), which quantifies the fraction of node pairs for which a given proximity measure is non-zero. Training and testing datasets. As shown earlier, we create three snapshots of friendship networks (S, S2, and S3) for each network in Table 2. We train link predictors by analyzing the differences in link relations of between S and S2, and predict who will likely make friendship relations in S3. Therefore, the training set is based on the period from S to S2, and the testing set is based on the period from S2 to S3. Note that we only consider the common users of the two snapshots when we count the new friendship links. Since friendship relations in our datasets are very sparse, we sample about 5, positives and 2, negatives for both training and testing purposes. Link predictors. We evaluate basic link predictors based on the following three classes of proximity measures. Distance based predictor: graph distance (). Node neighborhood based predictors: common neighbors (), Adamic/Adar (AA), preferential attachment (PA), and PageRank product (). Path-ensemble based predictor: ().

11 Dataset proximity embedding proximity sketch Incremental update Common neighbor Shortest path distance Job type Preprocess Query Preprocess Query Query Query Query Sample type Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Digg.8hrs.µs.µs.6hrs.3µs.2µs 7.3µs 2.µs 5.µs 2.µs 49.2µs 32.5µs Flickr 4.26hrs 47.9µs.8µs 4.88hrs 45.2µs 32.5µs 6.4µs 457.7µs 478.9µs 8.µs 3.µs 72.4µs LiveJournal 8.29hrs 4.6µs 3.7µs 8.7hrs 5.µs 3.4µs µs 734.µs 546.2µs 37.5µs µs 246.7µs MySpace 4.92hrs 58.8µs 84.µs 9.85hrs 62.µs 88.5µs µs 65.5µs 588.µs 84.8µs µs µs YouTube 2.27hrs 25.µs 2.6µs 4.67hrs 32.4µs 35.6µs 68.5µs 596.8µs 25.6µs 26.4µs µs 29.9µs Wikipedia 3.3hrs.5µs.2µs 3.75hrs.6µs.2µs 4.µs 46.5µs 54.µs 24.µs 377.9µs 374.6µs Table 6: Computation time of proximity sketch, proximity embedding, common neighbor and shortest path distance. PA AA positive 9.8% % 32.4% 32.% 56.6% 46.5% Digg negative 2.7% %.2%.% 29.9% 35.6% all 4.8% %.2%.% 37.8% 39.2% positive % 67.5% 63.3% 63.% 97.8% 97.3% Flickr negative % 8.%.%.% 96.% 69.9% all % 77.5% 2.8% 2.8% 96.4% 75.4% positive 99.4% % 27.5% 27.5% 74.7% 98.6% LiveJournal negative 93.% %.2%.3% 9.8% 93.% all 94.5% % 5.7% 5.6% 88.3% 94.5% positive % % 72.8% 72.8% % % MySpace negative % %.5%.5% % 95.% all % % 5.3% 5.3% % 95.9% positive % % 3.8% 3.9% 99.% % YouTube negative % %.3%.2% 9.2% % all % % 7.3% 7.% 93.6% % positive 73.2% % 44.7% 44.6% 74.5% 74.4% Wikipedia negative 88.9% % 4.3% 4.3% 82.9% 8.7% all 85.% % 4.3% 4.3% 8.9% 79.% Table 7: Applicable ratio of basic link predictors (fraction of positive/negative/all samples with non-zero proximity values) For composite link predictor, we present the results using the REPtree decision tree learner in WEKA machine learning package [2]. As noted in Section 3., REPtree is easy to control, visualize, and interpret. It also achieves accuracy similar to other machine learning algorithms Evaluation of Basic Link Predictors We calculate the trade-off curves of false positives and false negatives for each basic link predictor as shown in Figure 8. In addition to accuracy, we compute the fraction of node pairs with non-zero proximity values for each proximity measure in Table 7. Variation across predictors and datasets. In Figure 8, we observe significant variation in link prediction accuracy both across different datasets and across different link predictors: Neighborhood based proximity measures. The common neighbor (denoted as ) and the Adamic Adar (denoted as AA) perform well in LiveJournal, MySpace, and Flickr. But in terms of AR, both and AA could cover only two-hop relations, resulting in the worst AR. As a result, only about 3% of samples are non-zero in LiveJournal, Digg, and YouTube datasets (shown in Table 7). In comparison, both preferential attachment (PA) and PageRank product () yield much higher AR. performs best in YouTube and Wikipedia datasets (in Figure 8(c,f)), but performs worst in LiveJournal and Flickr datasets (in Figure 8(b,d)). These results suggest that each individual measure has its own merits and limitations. None of them performs universally well over all the datasets. Path-ensemble based proximity measure. We evaluate using two different damping factors β =.5 and.5. The results of using these two different damping factors are similar, and we only report the results using β =.5 in the interest of brevity. achieves both low and low. is the best in Digg and MySpace datasets, and the second best in Live- Journal and Flickr. In YouTube dataset, does not give a good Network # of high deg. links # of total links % of high deg. links Digg,86,73 4,83, % Flickr 23,535,63 3,393, % LiveJournal 6,58,439 6,92, % MySpace 43,877,84 9,629, % YouTube 9,392,367 3,7,64 72.% Wikipedia 8,936,865 33,974, % Table 8: Proportion of links connected to top.% highest degree nodes in each network. trade-off curve. In Wikipedia dataset, performs slightly worse than, which is optimized for predicting hyperlink structures. We will further investigate the reason behind the performance of later in this section. Graph distance based proximity measure. The graph distance (denoted as ) achieves high accuracy in Digg, LiveJournal, MySpace, Flickr and Wikipedia. But it has several important limitations. First, shortest path distance is determined solely by the shortest path and takes only integer values. This means that it has coarse granularity and introduces many ties. Second, the computation of shortest path distance is much more expensive in large networks [5] (also shown in Table 6). Effect of link symmetry. To understand the effect of link asymmetry, we make the friendship relation as reciprocal by adding reverse link in all existing node pairs. Table 2 shows the fraction of asymmetric links in our dataset. Figure 9 shows how the symmetric predictors performs in asymmetric datasets. In Wikipedia and Digg datasets, adding reverse edges improves the accuracy. But adding reverse edges does not always help; it stays almost the same in Flickr or a bit worse in LiveJournal. A good empirical performance indicator. Despite the significant variation, we find that the ranking between sophisticated predictors such as versus simple predictors such as common neighbors and Adamic/Adar can be qualitatively predicted based on the fraction of edges contributed by the highest degree nodes as follows. Table 8 shows the number of links incoming and outgoing from the highest degree nodes, the number of links in the entire network and their proportion, where a node is considered a high degree node if its node degree (combining incoming and outgoing degrees) is among the top.% highest node degrees. Ranking different networks based on the percentage of links connected to such highest degree nodes (in decreasing order), we obtain: LiveJournal > Flickr > Y outube Wikipedia > MySpace > Digg which is consistent with the ranking between direct proximity measures and sophisticated measures shown in Figure 8 and Figure 9. The ranking shows that as the high degree node s coverage increases, direct proximity measures (such as the number of common neighbors and shortest path distances) perform better than and vice versa.

12 PA 2 AA REPtree... (a) Digg 6 4 PA 2 AA REPtree... (b) Flickr 6 4 PA 2 AA REPtree... (c) LiveJournal PA 2 AA REPtree... (d) MySpace 6 4 PA 2 AA REPtree... (e) YouTube Figure 8: Link prediction accuracy for different online social networks 6 4 PA 2 AA REPtree... (f) Wikipedia Evaluation of Composite Link Predictor Next we combine multiple proximity measures to improve link prediction accuracy. When building a decision tree, at each node, REPtree chooses the best attribute (e.g., proximity measure) which splits samples into different classes (e.g., positive and negative) in the most effective way using information gain as criterion. To draw the entire accuracy trade-off curve, we vary the weights of positive and negative samples in the training set. When the weights of positive samples are large, the learner focuses more on positive samples in classification and tries to minimize false negatives; when the weights of negative samples are large, the learner focuses more on negative samples and tries to minimize false positives. Figure depicts examples of decision trees in Digg, Flickr, and Wikipedia (where the weights of positive:negative samples are :, :, and :, respectively). The decision process starts from the root node, it drills down by examining the corresponding metric value until reaching one of leaf nodes. Figure 8 shows the performance of decision tree in different online social networks. We observe that REPtree consistently achieves the best accuracy. Specifically, REPtree outperforms the best basic link predictors in Flickr, YouTube, and Wikipedia dataset. In Figure 8 (b) and (c), while the overall accuracy trade-off curve for REPtree is very close to the curve for shortest path distance, the composite link predictor provides much better resolution and fills in the intermediate points missed by the shortest path predictor. This is useful when we want to further classify possible friendship candidates with the same shortest path distance. Since the shortest path distance is integer-valued and is typically very small in social networks (due to the small-world effect [27]), it yields only coarsegrained control (e.g., depending on whether the threshold is 2 or 3 hops, a large number of node pairs will be classified as positive or negative at the same time). 5. RELATED WORK Social networks. Traditional social networks have been widely studied by sociologists. Refer to [52] for a detailed review. Social networks have also found a wide range of applications in business (e.g., viral marketing [23], fraud detection []), information technology (e.g., improving Internet search [35]), computer networks (e.g., overlay network construction [45]), and cyber security (e.g., spam mitigation [22], identity verification [56]). The explosive growth of online social networks has attracted significant attention [2, 3, 5, 37]. In particular, [3, 37] have conducted in-depth measurement-based studies of several online social networks and report the power-law, small-world, and scale-free properties of online social networks. Different from these studies, which analyze the general network topological structure, we focus on scalable proximity estimation and link prediction in online social networks. Proximity measures. There is a rich body of literature on proximity measures (e.g., [,26,28,3,3,38,47,48,5]). For example, [5] proposes escape probability as a useful measure of direction-aware proximity, which is closely related the rooted PageRank we consider. But their technique for computing escape probability can only scale to networks with tens of thousands of nodes. Recent works on proximity measures (e.g., [28, 48]) either dismiss pathensemble based proximity measures due to their high computational cost or leave it as future work to compare with them. Link prediction. [3, 3] first define the link prediction problem for social networks. To predict links, they calculate ten proximity measures between node pairs of the graph. The nodes are ranked based on their scores, where node pairs with higher score are more likely to form a link. They measure the effectiveness of the different proximity measures in predicting links in five co-authorship networks. However, they are limited to relatively small networks with only about 5 nodes. Moreover, they do not combine different measures to gain better link prediction accuracy. Dimensionality reduction. A variety of dimensionality reduction techniques have been developed in the area of data stream computation (see [39] for a detailed survey). One powerful technique is sketch [9,, 24, 29], a probabilistic summary technique proposed for analyzing large streaming datasets. Sketches achieves dimensionality reduction using projections along random vectors. Our proximity sketch is closely related to the count-min sketch [].

13 S 2 S S... (a) Digg 6 4 S 2 S S... (b) LiveJournal S 2 S S... (c) Flickr 6 4 S 2 S S... (d) Wikipedia Figure 9: Link prediction accuracy with undirected edges. Note that edges in MySpace and YouTube are already undirected. >3.5 <3.5 > 3.5 < 3.5 > 2.5 < 2.5 > 2.5 NO >.9 <.9 >85.7 <85.7 YES YES NO >2.5 <2.5 YES YES PA < 2.5 >296.5 <296.5 NO (a) Digg > 2.5 PA < 2.5 >54.5 <54.5 YES PA PA <356.5 >356.5>.5 <.5>9.5 <9.5 YES NO NO AA PA >.2 <.2 > 4.5 < 4.5 >439 <439 NO YES NO NO >.5 <.5 >.5 <.5 YES NO YES NO >48.7 <48.7 >. <. NO YES PA >.5 <.5 >.5 <.5 AA YES NO >.6 <.6 >33. <33. YES NO YES NO (b) Flickr (c) Wikipedia Figure : Examples of decision trees for link prediction In computer networking, Ng and Zhang [43] propose the first network embedding technique, called Global Network Positioning (GNP), to infer end-to-end delay by embedding network delay in a low-dimensional Euclidean space. Several enhancements have been proposed since then [32, 34, 49]. In particular, [34] applies matrix factorization and can handle path asymmetry and violation of triangulation. To our knowledge, all these techniques have only been applied to networks with a few thousand nodes. Moreover, proximity is the opposite of distance the lower the distance the higher the proximity. Thus, techniques for network distance embedding may not work well for proximity embedding. 6. CONCLUSIONS In this paper, we develop several novel techniques to approximate a large family of proximity measures in massive, highly dynamic online social networks. Our techniques are accurate and can easily handle networks with millions of nodes, which are several orders of magnitude larger than what existing methods can support. We then conduct extensive experiments on the effectiveness of a variety of proximity measures for predicting links in five popular online social networks. Our key new findings include: (i) the effectiveness of different proximity measures varies significantly across different networks and heavily depends on the fraction of edges contributed by the highest degree nodes, and (ii) combining multiple proximity measures consistently yields the best accuracy. In the future, we plan to leverage the insights we gain in this paper to design better proximity measures and more accurate link prediction algorithms. Acknowledgments This work was supported in part by NSF grants S and CCF We thank Inderjit Dhillon and the anonymous reviewers for their valuable feedback. We also thank Alan Mislove and Krishna Gummadi for sharing their online social network datasets.

14 7. REFERENCES [] L. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 23. [2] L. A. Adamic, O. Buyukkokten, and E. Adar. A social network caught in the web. First Monday, 23. [3] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proc. of WWW, 27. [4] Alexa global top 5 sites. [5] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In Proc. of KDD, 26. [6] A. L. Barabasi, H. Jeong, Z. Néda, E. Ravasz, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaboration. Physica A: Statistical Mechanics and its Application, 22. [7] R. M. Bell, Y. Koren, and C. Volinsky. Chasing $,,: How we won the Netflix Progress Prize. Statistical Computing and Statistical Graphics Newsletter, 8(2):4 2, 27. [8] S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In Proc. of WWW, 27. [9] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. of International Colloquium on Automata, Language and Programming (ICALP), 22. [] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 25. [] C. Cortes, D. Pregibon, and C. T. Volinsky. Communities of interest. Intelligent Data Analysis, 22. [2] K. Csalogány, D. Fogaras, B. Rácz, T. Sarlós, and P. File. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Math, 25. [3] J. Davidsen, H. Ebel, and S. Bornholdt. Emergence of a small world from local interactions: Modeling acquaintance networks. Physical Review Letters, 22. [4] Digg. [5] Digg API. [6] P. G. Doyle and J. L. Snell. Random walks and electric networks, Jan. 2. [Online]. Available at [7] D. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide flock of Condors: Load sharing among workstation clusters. Future Generation Computer Systems, 996. [8] Facebook. [9] Facebook statistics. http: // [2] Flickr. [2] S. Garner. Weka: The waikato environment for knowledge analysis. In Proceedings of the New Zealand Computer Science Research Students Conference, pages 57 64, [22] S. Garriss, M. Kaminsky, M. J. Freedman, B. Karp, D. Mazieres, and H. Yu. RE: Reliable . In Proc. of Networked Systems Design and Implementation (NSDI), 26. [23] S. Hill, F. Provost, and C. Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, 26. [24] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. of the 4st Symposium on Foundations of Computer Science, 2. [25] E. M. Jin, M. Girvan, and M. E. J. Newman. The structure of growing social networks. Physical Review Letters, 2. [26] L.. A new status index derived from sociometric analysis. Psychometrika, 953. [27] M. Kochen, editor. The Small World. Ablex, Norwood, NJ, 989. [28] Y. Koren, S. C. North, and C. Volinsky. Measuring and extracting proximity in networks. In Proc. of KDD, 26. [29] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proc. of IMC, 23. [3] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Proc. of Conference on Information and Knowledge Management (CIKM), 23. [3] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 27. [32] H. Lim, J. Hou, and C. H. Choi. Constructing Internet coordinate system based on delay measurement. In Proc. of IMC, 23. [33] LiveJournal. [34] Y. Mao and L. K. Saul. Modeling distances in large-scale networks by matrix factorization. In Proc. of IMC, New York, NY, USA, 24. [35] A. Mislove, K. P. Gummadi, and P. Druschel. Exploiting social networks for Internet search. In Proc. of HotNets-V, 26. [36] A. Mislove, H. S. Koppula, K. Gummadi, P. Druschel, and B. Bhattacharjee. Growth of the flickr social network. In Proc. of SIGCOMM Workshop on Social Networks (WOSN), 28. [37] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proc. of IMC, 27. [38] T. Murata and S. Moriyasu. Link prediction of social networks based on weighted proximity measures. In Proc. of International Conference on Web Intelligence, Washington, DC, USA, 27. IEEE Computer Society. [39] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 25. [4] MySpace. [4] MySpace 4 millionth user. http: //profile.myspace.com/index.cfm?fuseaction= user.viewprofile&friendid=4. [42] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 23. [43] T. E. Ng and H. Zhang. Predicting internet network distance with coordinate-based appoaches. In Proc. of INFOCOMM, 22. [44] Nielson Online. Fastest growing social networks for September 28.http: //blog.nielsen.com/nielsenwire/wp-content/ uploads/28//press_release24.pdf. [45] J. A. Patel, I. Gupta, and N. Contractor. JetStream: Achieving predictable gossip dissemination by leveraging social network principles. In Proc. of IEEE Network Computing and Applications (NCA), 26. [46] J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 993. [47] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Proc. of NIPS, 22. [48] P. Sarkar, A. W. Moore, and A. Prakash. Fast incremental proximity search in large graphs. In Proc. of ICML, 28. [49] L. Tang and M. Crovella. Virtual landmarks for the Internet. In Proc. of IMC, 23. [5] H. Tong, C. Faloutsos, and Y. Koren. Fast direction-aware proximity for graph mining. In Proc. of KDD, 27. [5] M. V. Vieira, B. M. Fonseca, R. Damazio, P. B. Golgher, D. de Castro Reis, and B. Ribeiro-Neto. Efficient search ranking in social networks. In Proc. of CIKM, 27. [52] S. Wasserman and K. Faust. Social Networks Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK, 994. [53] Wikipedia. Social network. [54] Wikipedia. [55] YouTube. [56] H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman. SybilGuard: Defending against Sybil attacks via social networks. In Proc. of SIGCOMM, 26.

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Link Prediction in Social Networks

Link Prediction in Social Networks CS378 Data Mining Final Project Report Dustin Ho : dsh544 Eric Shrewsberry : eas2389 Link Prediction in Social Networks 1. Introduction Social networks are becoming increasingly more prevalent in the daily

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

Performance of networks containing both MaxNet and SumNet links

Performance of networks containing both MaxNet and SumNet links Performance of networks containing both MaxNet and SumNet links Lachlan L. H. Andrew and Bartek P. Wydrowski Abstract Both MaxNet and SumNet are distributed congestion control architectures suitable for

More information

An Analysis of Social Network-Based Sybil Defenses

An Analysis of Social Network-Based Sybil Defenses An Analysis of Social Network-Based Sybil Defenses ABSTRACT Bimal Viswanath MPI-SWS bviswana@mpi-sws.org Krishna P. Gummadi MPI-SWS gummadi@mpi-sws.org Recently, there has been much excitement in the research

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

Six Degrees of Separation in Online Society

Six Degrees of Separation in Online Society Six Degrees of Separation in Online Society Lei Zhang * Tsinghua-Southampton Joint Lab on Web Science Graduate School in Shenzhen, Tsinghua University Shenzhen, Guangdong Province, P.R.China zhanglei@sz.tsinghua.edu.cn

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Challenges in Computer Science April 1, 2014 Frank Takes (ftakes@liacs.nl) LIACS, Leiden University Overview Context Social Network Analysis Online Social Networks Friendship Graph

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why

More information

IBM SPSS Modeler Social Network Analysis 15 User Guide

IBM SPSS Modeler Social Network Analysis 15 User Guide IBM SPSS Modeler Social Network Analysis 15 User Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 25. This edition applies to IBM

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Link Prediction Analysis in the Wikipedia Collaboration Graph

Link Prediction Analysis in the Wikipedia Collaboration Graph Link Prediction Analysis in the Wikipedia Collaboration Graph Ferenc Molnár Department of Physics, Applied Physics, and Astronomy Rensselaer Polytechnic Institute Troy, New York, 121 Email: molnaf@rpi.edu

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Mining Social-Network Graphs

Mining Social-Network Graphs 342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

A Direct Numerical Method for Observability Analysis

A Direct Numerical Method for Observability Analysis IEEE TRANSACTIONS ON POWER SYSTEMS, VOL 15, NO 2, MAY 2000 625 A Direct Numerical Method for Observability Analysis Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper presents an algebraic method

More information

Sybilproof Reputation Mechanisms

Sybilproof Reputation Mechanisms Sybilproof Reputation Mechanisms Alice Cheng Center for Applied Mathematics Cornell University, Ithaca, NY 14853 alice@cam.cornell.edu Eric Friedman School of Operations Research and Industrial Engineering

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Section 13.5 Equations of Lines and Planes

Section 13.5 Equations of Lines and Planes Section 13.5 Equations of Lines and Planes Generalizing Linear Equations One of the main aspects of single variable calculus was approximating graphs of functions by lines - specifically, tangent lines.

More information

6. Cholesky factorization

6. Cholesky factorization 6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

2.2 Creaseness operator

2.2 Creaseness operator 2.2. Creaseness operator 31 2.2 Creaseness operator Antonio López, a member of our group, has studied for his PhD dissertation the differential operators described in this section [72]. He has compared

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Enhancing the Ranking of a Web Page in the Ocean of Data

Enhancing the Ranking of a Web Page in the Ocean of Data Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

More information

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year. This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results Evaluation of a New Method for Measuring the Internet Distribution: Simulation Results Christophe Crespelle and Fabien Tarissan LIP6 CNRS and Université Pierre et Marie Curie Paris 6 4 avenue du président

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

More information

Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

More information

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

A Measurement-driven Analysis of Information Propagation in the Flickr Social Network

A Measurement-driven Analysis of Information Propagation in the Flickr Social Network A Measurement-driven Analysis of Information Propagation in the Flickr Social Network Meeyoung Cha MPI-SWS Campus E1 4 Saarbrücken, Germany mcha@mpi-sws.org Alan Mislove MPI-SWS Campus E1 4 Saarbrücken,

More information

3. INNER PRODUCT SPACES

3. INNER PRODUCT SPACES . INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

Load balancing Static Load Balancing

Load balancing Static Load Balancing Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Random graphs with a given degree sequence

Random graphs with a given degree sequence Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

Social Network Mining

Social Network Mining Social Network Mining Data Mining November 11, 2013 Frank Takes (ftakes@liacs.nl) LIACS, Universiteit Leiden Overview Social Network Analysis Graph Mining Online Social Networks Friendship Graph Semantics

More information

Master s Thesis. A Study on Active Queue Management Mechanisms for. Internet Routers: Design, Performance Analysis, and.

Master s Thesis. A Study on Active Queue Management Mechanisms for. Internet Routers: Design, Performance Analysis, and. Master s Thesis Title A Study on Active Queue Management Mechanisms for Internet Routers: Design, Performance Analysis, and Parameter Tuning Supervisor Prof. Masayuki Murata Author Tomoya Eguchi February

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Common Core Unit Summary Grades 6 to 8

Common Core Unit Summary Grades 6 to 8 Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Stability of QOS. Avinash Varadarajan, Subhransu Maji {avinash,smaji}@cs.berkeley.edu

Stability of QOS. Avinash Varadarajan, Subhransu Maji {avinash,smaji}@cs.berkeley.edu Stability of QOS Avinash Varadarajan, Subhransu Maji {avinash,smaji}@cs.berkeley.edu Abstract Given a choice between two services, rest of the things being equal, it is natural to prefer the one with more

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

COUNTING INDEPENDENT SETS IN SOME CLASSES OF (ALMOST) REGULAR GRAPHS

COUNTING INDEPENDENT SETS IN SOME CLASSES OF (ALMOST) REGULAR GRAPHS COUNTING INDEPENDENT SETS IN SOME CLASSES OF (ALMOST) REGULAR GRAPHS Alexander Burstein Department of Mathematics Howard University Washington, DC 259, USA aburstein@howard.edu Sergey Kitaev Mathematics

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

More information