K A I S T Department of Computer Science

Impact of snowball sampling ratios on networ characteristics estimation: A case study of Cyworld Haewoon Kwa, Seungyeop Han, Yong-Yeol Ahn, Sue Moon, and Hawoong Jeong KAIST CS/TR-2006-262 November 2006 K A I S T Department of Computer Science

Impact of snowball sampling ratios on networ characteristics estimation: A case study of Cyworld Haewoon Kwa, Seungyeop Han, Yong-Yeol Ahn, Sue Moon, and Hawoong Jeong Korea Advanced Institute of Science and Technology {haewoon, syhan}@an.aist.ac.r, yongyeol@gmail.com, sbmoon@cs.aist.ac.r, hjeong@aist.ac.r Abstract Today s social networing services have tens of millions of users, and are growing fast. Their sheer size poses a significant challenge in capturing and analyzing their topological characteristics. Snowball sampling is a popular method to crawl and sample networ topologies, but requires a high sampling ratio for accurate estimation of certain metrics. In this wor, we evaluate how close topo-logical characteristics of snowball sampled networs are to the complete networ. Instead of using a synthetically generated topology, we use the complete topology of Cyworld ilchon networ. The goal of this wor is to determine sampling ratios for accurate estimation of ey topological charac-teristics, such as the degree distribution, the degree correlation, the assortativity, and the clustering coefficient. I. INTRODUCTION Social networing service (SNS) has been proliferating on the Internet and gaining popularity [13]. The number of users of MySpace 1 exceeds more than 100 million users, and other popular SNSs including orut 2 have also more than 10 million users. Much research on SNS has been done [2], [14]. However, the size of SNS is too large to handle, and most of the research has depended on networ sampling methods. Snowball sampling is one of the popular networing sampling methods, such as node sampling, lin sampling, etc. All sampling methods tae a part of the complete networ by some rules. As the sampling ratio increases, the networ more accurately describes the complete networ. In other words, the higher sampling ratio, the more accurate estimation results one. However, for the high sampling ratio, more space and time are required. The trade-off lies between accuracy and resource. Lee et al. has studied the impact of sampling ratios on parameter estimation for some networ sampling methods, such as snowball sampling [7]. However, the size of the networs they have analyzed, including the Internet at the autonomous systems (AS) level, is two to three orders of magnitude smaller than todays commercially available SNSs. Characteristics of a networ topology are often not straightforward to capture by a metric or two, and the difficulty of accurate estimation lies in the diversity of metrics and their sensitivity to topology. Palmer et al. have showed that their ANF algorithm outperforms the other sampling methods including snowball sampling in estimating neighbourhood function, N(h), which is the number of pairs of nodes within distance h [11]. However, they focused only on the neighbourhood function. Recently, Lesovec et al. have suggested using densification laws to reject bad sampled subgraphs [8]. In this paper, we investigate the appropriate snowball sampling ratio for accurate estimation of topological characteristics, such as the degree distribution, the degree correlation, the assortativity, and the clustering coefficient. We compare the complete topology of the Cyworld ilchon networ to sampled networs with varying sampling ratios. Our contributions lie in the methodological approach in the evaluation of the snowball sampling methods with Cyworld. A. Cyworld II. BACKGROUND Cyworld is the largest online social networing service in South Korea [12], launched in September 2001. By the end of November 2005, the number of Cyworld users sur-passed 12 million, 27% of the total population of South Korea. Lie other SNSs, Cyworld provides the means to build and manage human relationships. In Cyworld, 1 http://myspace.com 2 http://orut.com

a member may invite other members to establish a relationship called ilchon between them simply by clicing the add a friend button. Then, if the invited member accepts it, an ilchon relationship is established. Online ilchon relationships reflect various inds of real-world relationships: close friends and relatives, entertainment business worers and their fans, hob-byists of an interest, etc. From a topological view, ilchon relationships form an undirected networ. We get a graph structure by mapping a member into a node and an ilchon relationship into an edge. As an ilchon relationship is established with mutual acceptances, edges in an ilchon networ are undirected. The number of nodes in an ilchon networ thus represents the total number of Cyworld users, and the number of edges the number of established ilchon relationships. B. Snowball Sampling Method The process of snowball sampling is very similar to the breadth-first search. First, we randomly choose a seed node. Next, we visit all the nodes directly connected to the seed node. Here, we define a layer as the nodes which have the same hop count to reach a seed node. In other words, all nodes directly connected to the seed node are said to be on the first layer. Then, we get the second layer by visiting all nodes directly connected to the nodes on the first layer. This process continues until the number of visited nodes reaches a limit determined by the sampling ratio. In order to control the number of visited nodes, we stop sampling even if not all the nodes on the last layer are visited. By definition, hub nodes are connected to many nodes, and liely to be visited in few layers in snowball sampling. Snowball sampled networs have a bias towards hub nodes. While this bias leads overestimation of some metrics lie degree distribution, whether the initial node is a hub or not does not mae a noticeable difference in characterizing the sampled networ [7]. Nevertheless, the snowball sampling method is the one of only a few available solutions when we crawl the web. When we crawl an SNS site for topology information gathering, we can only follow the lins from a node each time. We do not have random access to the overall topology of the networ, and cannot directly access nodes or lins which are not adjacent to each other. This is the reason why the lin and node sampling methods are infeasible in crawling. III. CHARACTERISTICS OF SAMPLED NETWORKS ON CYWORLD In this section, we compare the degree distribution, degree correlation, and assortativity of the complete Cyworld networ with those of sampled networs with various sampling ratios. We use an anonymized snapshot of the entire ilchon networ topology in November 2005. From this complete networ topology, we construct sampled networs with varying sampling ratios. The sampling ratios vary from 0.1% to 1.0% in steps of 0.1%. Below the sampling ratio of 0.1% and beyond the sampling ratio of 1.0%, we use the following sampling ratios only: 0.01%, 0.05%, 0.25%, 0.75%, and 2.0 A. Degree Distribution The degree of a node is defined as the number of edges incident to the node. Many real networs, including online SNSs, are nown to have a power-law degree distribution p() r, where p() is the empirical probability of a node having degree [1], [4], [10]. The degree distribution is often plotted as a complementary cumulative probability function (CCDF), () P ( )d α (γ 1). Figure 1 plots the complementary cumulative degree distribution of the complete Cyworld ilchon networ. The plot exhibits a multi-scaling behavior: the first part below the degree of 500 has an estimated exponent of -5 and the second part has -2. The well-nown Hill estimator is used to estimate the tail behavior of a distribution [6]. Crovella et al. have proposed an empirically sound method to estimate the scaling exponent of a heavy tail distribution [3]. However, both approaches are not directly applicable to distributions of multi-scaling behavior. We resort to manual identification of two different scaling regions and line-fitting in our exponent estimation. We plot the degree distributions of sample networs with sampling ratios of 0.1%, 0.2%, 0.25%, and 0.5% in Figure 2. The multi-scaling behavior observed in the complete networ appears only in 1 out of 5 sampled networs with the sampling ratio of 0.1%, 3 out of 5 with the sampling ratio of 0.2%, and in all 5 with the sampling ratio of 0.25% and 0.5%. Although one sampled networ in Figure 2(c) has a different scaling exponent from four other sampled networs, it basically comes from the sampling methodology which taes a different part of the complete networ.

() Fig. 1. -4-1 10-7 10-8 10 6 Degree distribution of the complete Cyworld Networ () () 10 6 10 6 (a) ratio = 0.1% (b) ratio = 0.2% -2.5 () () -0.8 10 6 (c) ratio = 0.25% 10 6 (d) ratio = 0.5% Fig. 2. Degree distributions from sampled networs of Cyworld From Figure 2, we observe a multi-scaling behavior from all five sampled networs when the sampling ratio is larger than 0.25%. In contrast, we note that the multi-scaling behavior is present in only one or two sampled networs with ratio below 0.25%. When the sampling ratio is 0.5% or higher, we observe the multi-scaling behavior with exponents of -3.5 and -1.8, different from -5 and -2 of the complete networ. It has been reported that the scaling exponent of snowball sampled networs is smaller than that of the complete networ [7]; our results conforms to it. In addition, the scaling exponent of high-degree nodes is closer to that of the complete networ than that of low-degree nodes. In snowball sampling, high-degree nodes are more liely to be selected than others, and thus the scaling exponent of high-degree nodes approaches that of the complete networ more quicly than in the case of low-degree nodes. To detect multi-scaling behavior from sampled networs from any sampled networ, at least 0.25% or higher sam-pling ratio is required. The estimated scaling exponents approach those of the complete networ only when the sampling ratio goes above 0.5%. For scaling exponent values within the range of 1 the those of the complete

networ, a sampling ratio of 0.5% or higher is needed. The multi-scaling behavior has been observed in Internet bacbone traffic [5], [15], but not in any online networs, such as bloggers networs and other SNSs lie MySpace. To our best nowledge, the Cyworld ilchon networ is the first to exhibit this ind of behavior among SNSs. We suspect a heterogeneous maeup of Cyworld users to be attributable to this multi-scaling behavior, as one-time commercial events and famous entertainers attract anomalously large numbers of users than most friends and family relationships. However, further wor is needed to confirm our conjecture and explicate the causes. B. Degree Correlation nn () 10 6 Fig. 3. Degree correlation from complete networ The degree correlation is a mapping between a node degree and the mean degree of neighbors of nodes having degree. We plot the degree correlation of the complete networ in Figure 3. Although a complex pattern emerges, it is clear that the overall correlation is negative. It is nown that if the trend of a degree correlation of a networ is negative, a hub node tends to be connected non-hub nodes [9]. Therefore, the plot in Figure 3 tells that the hub nodes of Cyworld is more liely to be con-nected to non-hub nodes than to hub nodes. We plot the degree correlation of sampled networs with sampling ratio of 0.05%, 0.1%, 0.2%, and 0.3% in Figure 4. All plots show complex patterns, and it is very hard to tell which is the 4(b) closest to Figure 3. In close examination of Figures 4(a) and, we observe that some graphs do not have a negative trend for 1 < < 100. Thus, from these sampling experiments, we recommend a sampling ratio of 0.2% or higher for degree correlation estimation. C. Assortativity Assortativity is a measure to summarize the distribution of degree correlation. Lie the overall trend of degree correlation in Section III-B, a negative value of assorativity represents a tendency of a hub node to connect to non-hub nodes. The assortativity, r, can be written as below, M 1 i r = j i i [M 1 i 1 2 (j i + i )] 2 M 1 i 1 2 (j2 i + 2 i ) [M 1 i 1 2 (j i + i )], (1) 2 where j i and i are the degrees of the nodes at the ends of the ith edge, with i = 1, 2,, and M, and M is the number of all edges [9]. Figure 5 shows the changes in assorativity over sampling ratios of 0.01% to 2% sampled networs. The horizontal line is the assortativity of the complete Cyworld networ. At the sampling ratio larger than 0.9%, the assorativity values of sampled networs fall within the range of complete networs. In [7], sampled networs from snowball sampling were shown to be more disassortative than the original networ whether the original networ is assortative or disassortative, but our results are very different. We cannot find some evidences which support that sampled networs are more disassortative than the original networ in our results. It

10 6 10 6 nn () nn () (a) ratio = 0.05% (b) ratio = 0.1% 10 6 10 6 nn () nn () (c) ratio = 0.2% (d) ratio = 0.3% Fig. 4. Degree correlation from sampled networs of Cyworld 0.6 0.5 sampled networ full networ 0.4 0.3 Assortativity 0.2 0.1 0-0.1-0.2-0.3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 sampling ratio Fig. 5. Assortativity of sampled networs of Cyworld. The red line is from complete networ of Cyworld. seems that the assortativity of sampled networs has nothing to do with that of the complete networ. We are not sure why disassortative and assortative sampled networs are mixed up in our results. D. Clustering Coefficient The clustering coefficient C i of node i is the ratio of the total number of the existing lins between its neighbors, y, to the total number of all possible lins between its neighbors [4], and can be written as below, C i = 2y i ( i 1), (2) where i is the degree of node i. Figure 6 depicts average clustering coefficients of the sampled Cyworld networs with various sampling ratio. The minimum sampling ratio is 0.15%, and the maximum sampling ratio is 2.0%. We observe a gap between the sampled networ and complete networ closes as the sampling ratio increases.

clustering coefficient 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 sampling ratio sampled networ full networ Fig. 6. Average clustering coefficients of sampled networs of Cyworld. The red line is from complete networ of Cyworld In addition, all clustering coefficient values of sampled networs are larger than that of the complete networ. It means that the sampled networs from snowball sampling are more liely to be clustered than the complete networ. IV. DISCUSSION We have described the characteristics of sampled networs derived from the Cyworld ilchon networ. From the results, we could get some insights on snowball sampling. However, the results in this paper come from only one real networ, Cyworld, and may not apply to other networs. To generalize our results, we need to investigate if Cyworld is a representative of other social networs. In future, we plan to compare other SNSs with Cyworld. A simple qualitative analysis gave us some intuition about this problem. In Section III-A, Cyworld has two inds of ilchon networs corresponding to different scaling behavior. From the relationship between members perspective, there are two inds of ilchon in Cyworld. We conjecture that one reflects offline friendship, the other does online community. Except the difference of cultures of different countries, basically other SNSs support the similar feature with ilchon called friend, etc. Therefore, we expect that the friend networ would be similar to ilchon networ of Cyworld, because the members of other SNSs also mae their offline friends of friends in online, and the other SNSs support many features to meet attractive member online, for example, conditional search and form a new online community. Therefore, members in other SNSs also have on-line and off-line friends. This is the reason why we conjecture our results can be generalized. V. SUMMARY In this wor, we have studied the appropriate snowball sampling ratio of real social networing service, Cyworld. For good representation of multi-scaling behavior the degree distribution, we need a sampling ratio of 0.25% or larger. For accurate estimation of a negative trend of the degree correlation from all sampled networs, the sampling ratio of 0.2% or larger is required. For precise estimation of assortativity, we need a sampling ratio of 0.9% or larger. Finally, we can observe a gap between the sampled networ and complete networ closes as the sampling ratio increases for estimation of clustering coefficient. VI. ACKNOWLEDGEMENT We than Jeongsu Hong and Jaehyun Lim of SK Communications, Inc. for providing us with the Cyworld data. REFERENCES [1] R. Albert and A.-L. Barabasi. Statistical mechanics of complex networs. Reviews of Modern Physics, 74:47, 2002. [2] W. Bachni, S. Szymczy, P. Leszczynsi, R. Podsiadlo, E. Rymszewicz, L. Kurylo, D. Maowiec, and B. Byowsa. Quantitive and sociological analysis of blog networs. ACTA PHYSICA POLONICA B, 36:2435, 2005. [3] M. E. Crovella and M. S. Taqqu. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability, 1(1):55 79, July 1999. [4] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networs. Advances in Physics, 51:1079, 2002.

[5] A. Feldmann, A. C. Gilbert, and W. Willinger. Data networs as cascades: Investigating the multifractal nature of internet WAN traffic. In SIGCOMM, pages 42 55, 1998. [6] B. M. Hill. A simple general approach to inference about the tail of a distribution. The Annals of Statistics, 3(5):1163 1174, 1975. [7] S. H. Lee, P.-J. Kim, and H. Jeong. Statistical properties of sampled networs. Phys. Rev. E, 73:016102, 2006. [8] J. Lesovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrining diameters and possible explanations. In SIGKDD, pages 177 187. ACM Press, 2005. [9] M. E. J. Newman. Assortative mixing in networs. Phys. Rev. Lett., 89:208701, 2002. [10] M. E. J. Newman. The structure and function of complex networs. SIAM Review, 45:167, 2003. [11] C. Palmer, P. Gibbons, and C. Faloutsos. ANF: A fast and scalable tool for data mining in massive graphs. In SIGKDD, 2002. [12] Raney.com. http://www.raney.com. [13] Techweb. http://www.techweb.com. [14] S. Wasserman and K. Faust. Social networ analysis. Cambridge University Press, Cambridge, 1994. [15] Z. Zhang, V. Ribeiro, S. Moon, and C. Diot. Small-Time scaling behaviors of internet bacbone traffic: an empirical study. In INFOCOM, 2003.