K A I S T Department of Computer Science



Similar documents
Analysis of Topological Characteristics of Huge Online Social Networking Services

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

The ebay Graph: How Do Online Auction Users Interact?


Research Article A Comparison of Online Social Networks and Real-Life Social Networks: A Study of Sina Microblogging

Measurement and Analysis of Online Social Networks

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Understanding the evolution dynamics of internet topology

Evolution of a large online social network

The Joint Degree Distribution as a Definitive Metric of the Internet AS-level Topologies

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures

Temporal Dynamics of Scale-Free Networks

How To Understand The Network Of A Network

Distance Degree Sequences for Network Analysis

Understanding Graph Sampling Algorithms for Social Network Analysis

Graph Mining Techniques for Social Media Analysis

Structural constraints in complex networks

The architecture of complex weighted networks

Some questions... Graphs

Inet-3.0: Internet Topology Generator

Computer Network Topologies: Models and Generation Tools

Evolution of the Internet AS-Level Ecosystem

A Graph-Based Friend Recommendation System Using Genetic Algorithm

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Online Estimating the k Central Nodes of a Network

An Analysis of Verifications in Microblogging Social Networks - Sina Weibo

Towards Modelling The Internet Topology The Interactive Growth Model

Analysis of Internet Topologies: A Historical View

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Eight Friends Are Enough: Social Graph Approximation via Public Listings

QUANTITIVE AND SOCIOLOGICAL ANALYSIS OF BLOG NETWORKS

Expansion Properties of Large Social Graphs

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

GENERATING AN ASSORTATIVE NETWORK WITH A GIVEN DEGREE DISTRIBUTION

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

A discussion of Statistical Mechanics of Complex Networks P. Part I

The Internet Is Like A Jellyfish

Structure of a large social network

PUBLIC TRANSPORT SYSTEMS IN POLAND: FROM BIAŁYSTOK TO ZIELONA GÓRA BY BUS AND TRAM USING UNIVERSAL STATISTICS OF COMPLEX NETWORKS

Graph theoretic approach to analyze amino acid network

The mathematics of networks

Time-Dependent Complex Networks:

S3G2: a Scalable Structure-correlated Social Graph Generator

Small-World Characteristics of Internet Topologies and Implications on Multicast Scaling

Crawling Facebook for Social Network Analysis Purposes

Comparison of Online Social Relations in Terms of Volume vs. Interaction: A Case Study of Cyworld

Information Network or Social Network? The Structure of the Twitter Follow Graph

Social Networs - A Review of alex hypothesis

Link Prediction in Social Networks

An Alternative Web Search Strategy? Abstract

Homophily in Online Social Networks

Comparing Trade-off Based Models of the Internet

Social Network Mining

Characterizing and Modelling Clustering Features in AS-Level Internet Topology

Analysis of Internet Topologies

Link Prediction Analysis in the Wikipedia Collaboration Graph

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

Network/Graph Theory. What is a Network? What is network theory? Graph-based representations. Friendship Network. What makes a problem graph-like?

An Investigation of Synchrony in Transport Networks

Predict the Popularity of YouTube Videos Using Early View Data

Evolution of a Location-based Online Social Network: Analysis and Models

Bioinformatics: Network Analysis

Beyond Social Graphs: User Interactions in Online Social Networks and their Implications

Collaborative Blog Spam Filtering Using Adaptive Percolation Search

Are You Moved by Your Social Network Application?

A box-covering algorithm for fractal scaling in scale-free networks

Structure and evolution of online social relationships: Heterogeneity in unrestricted discussions

NETWORK SCIENCE DEGREE CORRELATION

The Shape of the Network. The Shape of the Internet. Why study topology? Internet topologies. Early work. More on topologies..

School of Computer Science Carnegie Mellon Graph Mining, self-similarity and power laws

A Measurement-driven Analysis of Information Propagation in the Flickr Social Network

The Structure of Growing Social Networks

The Network Structure of Hard Combinatorial Landscapes

Grade 6 Mathematics Performance Level Descriptors

Business Intelligence and Process Modelling

Strong and Weak Ties

On the Impact of Route Monitor Selection

Simulating the Effect of Privacy Concerns in Online Social Networks

Efficient Search in Gnutella-like Small-World Peerto-Peer

Part 2: Community Detection

The Topology of Large-Scale Engineering Problem-Solving Networks

The Structure of an Autonomic Network

Routing on a weighted scale-free network

A dynamic model for on-line social networks

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

A SURVEY OF PROPERTIES AND MODELS OF ON- LINE SOCIAL NETWORKS

Practical Graph Mining with R. 5. Link Analysis

Random graph models of social networks

Statistical theory of Internet exploration

The Connectivity and Fault-Tolerance of the Internet Topology. Christopher R. Palmer Georgos Siganos Michalis Faloutsos

MINING COMMUNITIES OF BLOGGERS: A CASE STUDY

Theory of Aces: Fame by chance or merit?

Some Examples of Network Measurements

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

arxiv:cs.dm/ v1 30 Mar 2002

Social Networks and Social Media

STATISTICAL ANALYSIS OF THE INDIAN RAILWAY NETWORK: A COMPLEX NETWORK APPROACH

Information Diffusion in Mobile Social Networks: The Speed Perspective

Zipf s law and the Internet

Transcription:

Impact of snowball sampling ratios on networ characteristics estimation: A case study of Cyworld Haewoon Kwa, Seungyeop Han, Yong-Yeol Ahn, Sue Moon, and Hawoong Jeong KAIST CS/TR-2006-262 November 2006 K A I S T Department of Computer Science

Impact of snowball sampling ratios on networ characteristics estimation: A case study of Cyworld Haewoon Kwa, Seungyeop Han, Yong-Yeol Ahn, Sue Moon, and Hawoong Jeong Korea Advanced Institute of Science and Technology {haewoon, syhan}@an.aist.ac.r, yongyeol@gmail.com, sbmoon@cs.aist.ac.r, hjeong@aist.ac.r Abstract Today s social networing services have tens of millions of users, and are growing fast. Their sheer size poses a significant challenge in capturing and analyzing their topological characteristics. Snowball sampling is a popular method to crawl and sample networ topologies, but requires a high sampling ratio for accurate estimation of certain metrics. In this wor, we evaluate how close topo-logical characteristics of snowball sampled networs are to the complete networ. Instead of using a synthetically generated topology, we use the complete topology of Cyworld ilchon networ. The goal of this wor is to determine sampling ratios for accurate estimation of ey topological charac-teristics, such as the degree distribution, the degree correlation, the assortativity, and the clustering coefficient. I. INTRODUCTION Social networing service (SNS) has been proliferating on the Internet and gaining popularity [13]. The number of users of MySpace 1 exceeds more than 100 million users, and other popular SNSs including orut 2 have also more than 10 million users. Much research on SNS has been done [2], [14]. However, the size of SNS is too large to handle, and most of the research has depended on networ sampling methods. Snowball sampling is one of the popular networing sampling methods, such as node sampling, lin sampling, etc. All sampling methods tae a part of the complete networ by some rules. As the sampling ratio increases, the networ more accurately describes the complete networ. In other words, the higher sampling ratio, the more accurate estimation results one. However, for the high sampling ratio, more space and time are required. The trade-off lies between accuracy and resource. Lee et al. has studied the impact of sampling ratios on parameter estimation for some networ sampling methods, such as snowball sampling [7]. However, the size of the networs they have analyzed, including the Internet at the autonomous systems (AS) level, is two to three orders of magnitude smaller than todays commercially available SNSs. Characteristics of a networ topology are often not straightforward to capture by a metric or two, and the difficulty of accurate estimation lies in the diversity of metrics and their sensitivity to topology. Palmer et al. have showed that their ANF algorithm outperforms the other sampling methods including snowball sampling in estimating neighbourhood function, N(h), which is the number of pairs of nodes within distance h [11]. However, they focused only on the neighbourhood function. Recently, Lesovec et al. have suggested using densification laws to reject bad sampled subgraphs [8]. In this paper, we investigate the appropriate snowball sampling ratio for accurate estimation of topological characteristics, such as the degree distribution, the degree correlation, the assortativity, and the clustering coefficient. We compare the complete topology of the Cyworld ilchon networ to sampled networs with varying sampling ratios. Our contributions lie in the methodological approach in the evaluation of the snowball sampling methods with Cyworld. A. Cyworld II. BACKGROUND Cyworld is the largest online social networing service in South Korea [12], launched in September 2001. By the end of November 2005, the number of Cyworld users sur-passed 12 million, 27% of the total population of South Korea. Lie other SNSs, Cyworld provides the means to build and manage human relationships. In Cyworld, 1 http://myspace.com 2 http://orut.com

a member may invite other members to establish a relationship called ilchon between them simply by clicing the add a friend button. Then, if the invited member accepts it, an ilchon relationship is established. Online ilchon relationships reflect various inds of real-world relationships: close friends and relatives, entertainment business worers and their fans, hob-byists of an interest, etc. From a topological view, ilchon relationships form an undirected networ. We get a graph structure by mapping a member into a node and an ilchon relationship into an edge. As an ilchon relationship is established with mutual acceptances, edges in an ilchon networ are undirected. The number of nodes in an ilchon networ thus represents the total number of Cyworld users, and the number of edges the number of established ilchon relationships. B. Snowball Sampling Method The process of snowball sampling is very similar to the breadth-first search. First, we randomly choose a seed node. Next, we visit all the nodes directly connected to the seed node. Here, we define a layer as the nodes which have the same hop count to reach a seed node. In other words, all nodes directly connected to the seed node are said to be on the first layer. Then, we get the second layer by visiting all nodes directly connected to the nodes on the first layer. This process continues until the number of visited nodes reaches a limit determined by the sampling ratio. In order to control the number of visited nodes, we stop sampling even if not all the nodes on the last layer are visited. By definition, hub nodes are connected to many nodes, and liely to be visited in few layers in snowball sampling. Snowball sampled networs have a bias towards hub nodes. While this bias leads overestimation of some metrics lie degree distribution, whether the initial node is a hub or not does not mae a noticeable difference in characterizing the sampled networ [7]. Nevertheless, the snowball sampling method is the one of only a few available solutions when we crawl the web. When we crawl an SNS site for topology information gathering, we can only follow the lins from a node each time. We do not have random access to the overall topology of the networ, and cannot directly access nodes or lins which are not adjacent to each other. This is the reason why the lin and node sampling methods are infeasible in crawling. III. CHARACTERISTICS OF SAMPLED NETWORKS ON CYWORLD In this section, we compare the degree distribution, degree correlation, and assortativity of the complete Cyworld networ with those of sampled networs with various sampling ratios. We use an anonymized snapshot of the entire ilchon networ topology in November 2005. From this complete networ topology, we construct sampled networs with varying sampling ratios. The sampling ratios vary from 0.1% to 1.0% in steps of 0.1%. Below the sampling ratio of 0.1% and beyond the sampling ratio of 1.0%, we use the following sampling ratios only: 0.01%, 0.05%, 0.25%, 0.75%, and 2.0 A. Degree Distribution The degree of a node is defined as the number of edges incident to the node. Many real networs, including online SNSs, are nown to have a power-law degree distribution p() r, where p() is the empirical probability of a node having degree [1], [4], [10]. The degree distribution is often plotted as a complementary cumulative probability function (CCDF), () P ( )d α (γ 1). Figure 1 plots the complementary cumulative degree distribution of the complete Cyworld ilchon networ. The plot exhibits a multi-scaling behavior: the first part below the degree of 500 has an estimated exponent of -5 and the second part has -2. The well-nown Hill estimator is used to estimate the tail behavior of a distribution [6]. Crovella et al. have proposed an empirically sound method to estimate the scaling exponent of a heavy tail distribution [3]. However, both approaches are not directly applicable to distributions of multi-scaling behavior. We resort to manual identification of two different scaling regions and line-fitting in our exponent estimation. We plot the degree distributions of sample networs with sampling ratios of 0.1%, 0.2%, 0.25%, and 0.5% in Figure 2. The multi-scaling behavior observed in the complete networ appears only in 1 out of 5 sampled networs with the sampling ratio of 0.1%, 3 out of 5 with the sampling ratio of 0.2%, and in all 5 with the sampling ratio of 0.25% and 0.5%. Although one sampled networ in Figure 2(c) has a different scaling exponent from four other sampled networs, it basically comes from the sampling methodology which taes a different part of the complete networ.

() Fig. 1. -4-1 10-7 10-8 10 6 Degree distribution of the complete Cyworld Networ () () 10 6 10 6 (a) ratio = 0.1% (b) ratio = 0.2% -2.5 () () -0.8 10 6 (c) ratio = 0.25% 10 6 (d) ratio = 0.5% Fig. 2. Degree distributions from sampled networs of Cyworld From Figure 2, we observe a multi-scaling behavior from all five sampled networs when the sampling ratio is larger than 0.25%. In contrast, we note that the multi-scaling behavior is present in only one or two sampled networs with ratio below 0.25%. When the sampling ratio is 0.5% or higher, we observe the multi-scaling behavior with exponents of -3.5 and -1.8, different from -5 and -2 of the complete networ. It has been reported that the scaling exponent of snowball sampled networs is smaller than that of the complete networ [7]; our results conforms to it. In addition, the scaling exponent of high-degree nodes is closer to that of the complete networ than that of low-degree nodes. In snowball sampling, high-degree nodes are more liely to be selected than others, and thus the scaling exponent of high-degree nodes approaches that of the complete networ more quicly than in the case of low-degree nodes. To detect multi-scaling behavior from sampled networs from any sampled networ, at least 0.25% or higher sam-pling ratio is required. The estimated scaling exponents approach those of the complete networ only when the sampling ratio goes above 0.5%. For scaling exponent values within the range of 1 the those of the complete

networ, a sampling ratio of 0.5% or higher is needed. The multi-scaling behavior has been observed in Internet bacbone traffic [5], [15], but not in any online networs, such as bloggers networs and other SNSs lie MySpace. To our best nowledge, the Cyworld ilchon networ is the first to exhibit this ind of behavior among SNSs. We suspect a heterogeneous maeup of Cyworld users to be attributable to this multi-scaling behavior, as one-time commercial events and famous entertainers attract anomalously large numbers of users than most friends and family relationships. However, further wor is needed to confirm our conjecture and explicate the causes. B. Degree Correlation nn () 10 6 Fig. 3. Degree correlation from complete networ The degree correlation is a mapping between a node degree and the mean degree of neighbors of nodes having degree. We plot the degree correlation of the complete networ in Figure 3. Although a complex pattern emerges, it is clear that the overall correlation is negative. It is nown that if the trend of a degree correlation of a networ is negative, a hub node tends to be connected non-hub nodes [9]. Therefore, the plot in Figure 3 tells that the hub nodes of Cyworld is more liely to be con-nected to non-hub nodes than to hub nodes. We plot the degree correlation of sampled networs with sampling ratio of 0.05%, 0.1%, 0.2%, and 0.3% in Figure 4. All plots show complex patterns, and it is very hard to tell which is the 4(b) closest to Figure 3. In close examination of Figures 4(a) and, we observe that some graphs do not have a negative trend for 1 < < 100. Thus, from these sampling experiments, we recommend a sampling ratio of 0.2% or higher for degree correlation estimation. C. Assortativity Assortativity is a measure to summarize the distribution of degree correlation. Lie the overall trend of degree correlation in Section III-B, a negative value of assorativity represents a tendency of a hub node to connect to non-hub nodes. The assortativity, r, can be written as below, M 1 i r = j i i [M 1 i 1 2 (j i + i )] 2 M 1 i 1 2 (j2 i + 2 i ) [M 1 i 1 2 (j i + i )], (1) 2 where j i and i are the degrees of the nodes at the ends of the ith edge, with i = 1, 2,, and M, and M is the number of all edges [9]. Figure 5 shows the changes in assorativity over sampling ratios of 0.01% to 2% sampled networs. The horizontal line is the assortativity of the complete Cyworld networ. At the sampling ratio larger than 0.9%, the assorativity values of sampled networs fall within the range of complete networs. In [7], sampled networs from snowball sampling were shown to be more disassortative than the original networ whether the original networ is assortative or disassortative, but our results are very different. We cannot find some evidences which support that sampled networs are more disassortative than the original networ in our results. It

10 6 10 6 nn () nn () (a) ratio = 0.05% (b) ratio = 0.1% 10 6 10 6 nn () nn () (c) ratio = 0.2% (d) ratio = 0.3% Fig. 4. Degree correlation from sampled networs of Cyworld 0.6 0.5 sampled networ full networ 0.4 0.3 Assortativity 0.2 0.1 0-0.1-0.2-0.3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 sampling ratio Fig. 5. Assortativity of sampled networs of Cyworld. The red line is from complete networ of Cyworld. seems that the assortativity of sampled networs has nothing to do with that of the complete networ. We are not sure why disassortative and assortative sampled networs are mixed up in our results. D. Clustering Coefficient The clustering coefficient C i of node i is the ratio of the total number of the existing lins between its neighbors, y, to the total number of all possible lins between its neighbors [4], and can be written as below, C i = 2y i ( i 1), (2) where i is the degree of node i. Figure 6 depicts average clustering coefficients of the sampled Cyworld networs with various sampling ratio. The minimum sampling ratio is 0.15%, and the maximum sampling ratio is 2.0%. We observe a gap between the sampled networ and complete networ closes as the sampling ratio increases.

clustering coefficient 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 sampling ratio sampled networ full networ Fig. 6. Average clustering coefficients of sampled networs of Cyworld. The red line is from complete networ of Cyworld In addition, all clustering coefficient values of sampled networs are larger than that of the complete networ. It means that the sampled networs from snowball sampling are more liely to be clustered than the complete networ. IV. DISCUSSION We have described the characteristics of sampled networs derived from the Cyworld ilchon networ. From the results, we could get some insights on snowball sampling. However, the results in this paper come from only one real networ, Cyworld, and may not apply to other networs. To generalize our results, we need to investigate if Cyworld is a representative of other social networs. In future, we plan to compare other SNSs with Cyworld. A simple qualitative analysis gave us some intuition about this problem. In Section III-A, Cyworld has two inds of ilchon networs corresponding to different scaling behavior. From the relationship between members perspective, there are two inds of ilchon in Cyworld. We conjecture that one reflects offline friendship, the other does online community. Except the difference of cultures of different countries, basically other SNSs support the similar feature with ilchon called friend, etc. Therefore, we expect that the friend networ would be similar to ilchon networ of Cyworld, because the members of other SNSs also mae their offline friends of friends in online, and the other SNSs support many features to meet attractive member online, for example, conditional search and form a new online community. Therefore, members in other SNSs also have on-line and off-line friends. This is the reason why we conjecture our results can be generalized. V. SUMMARY In this wor, we have studied the appropriate snowball sampling ratio of real social networing service, Cyworld. For good representation of multi-scaling behavior the degree distribution, we need a sampling ratio of 0.25% or larger. For accurate estimation of a negative trend of the degree correlation from all sampled networs, the sampling ratio of 0.2% or larger is required. For precise estimation of assortativity, we need a sampling ratio of 0.9% or larger. Finally, we can observe a gap between the sampled networ and complete networ closes as the sampling ratio increases for estimation of clustering coefficient. VI. ACKNOWLEDGEMENT We than Jeongsu Hong and Jaehyun Lim of SK Communications, Inc. for providing us with the Cyworld data. REFERENCES [1] R. Albert and A.-L. Barabasi. Statistical mechanics of complex networs. Reviews of Modern Physics, 74:47, 2002. [2] W. Bachni, S. Szymczy, P. Leszczynsi, R. Podsiadlo, E. Rymszewicz, L. Kurylo, D. Maowiec, and B. Byowsa. Quantitive and sociological analysis of blog networs. ACTA PHYSICA POLONICA B, 36:2435, 2005. [3] M. E. Crovella and M. S. Taqqu. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability, 1(1):55 79, July 1999. [4] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networs. Advances in Physics, 51:1079, 2002.

[5] A. Feldmann, A. C. Gilbert, and W. Willinger. Data networs as cascades: Investigating the multifractal nature of internet WAN traffic. In SIGCOMM, pages 42 55, 1998. [6] B. M. Hill. A simple general approach to inference about the tail of a distribution. The Annals of Statistics, 3(5):1163 1174, 1975. [7] S. H. Lee, P.-J. Kim, and H. Jeong. Statistical properties of sampled networs. Phys. Rev. E, 73:016102, 2006. [8] J. Lesovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrining diameters and possible explanations. In SIGKDD, pages 177 187. ACM Press, 2005. [9] M. E. J. Newman. Assortative mixing in networs. Phys. Rev. Lett., 89:208701, 2002. [10] M. E. J. Newman. The structure and function of complex networs. SIAM Review, 45:167, 2003. [11] C. Palmer, P. Gibbons, and C. Faloutsos. ANF: A fast and scalable tool for data mining in massive graphs. In SIGKDD, 2002. [12] Raney.com. http://www.raney.com. [13] Techweb. http://www.techweb.com. [14] S. Wasserman and K. Faust. Social networ analysis. Cambridge University Press, Cambridge, 1994. [15] Z. Zhang, V. Ribeiro, S. Moon, and C. Diot. Small-Time scaling behaviors of internet bacbone traffic: an empirical study. In INFOCOM, 2003.