Sampling online social networks by random walk



Similar documents
Confidence Intervals for One Mean

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Hypothesis testing. Null and alternative hypotheses

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

I. Chi-squared Distributions

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Chapter 7 Methods of Finding Estimators

Determining the sample size

5: Introduction to Estimation

Output Analysis (2, Chapters 10 &11 Law)

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Modified Line Search Method for Global Optimization

Department of Computer Science, University of Otago

Lesson 17 Pearson s Correlation Coefficient

Hypergeometric Distributions

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Incremental calculation of weighted mean and variance

How To Calculate The Size Of An Etwork On A Graph From A Facebook Graph

LECTURE 13: Cross-validation

Chapter 7: Confidence Interval and Sample Size

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Properties of MLE: consistency, asymptotic normality. Fisher information.

Quadrat Sampling in Population Ecology

Professional Networking

Lecture 2: Karger s Min Cut Algorithm

Normal Distribution.

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.


Math C067 Sampling Distributions

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Maximum Likelihood Estimators.

Measures of Spread and Boxplots Discrete Math, Section 9.4

PSYCHOLOGICAL STATISTICS

1. C. The formula for the confidence interval for a population mean is: x t, which was

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

CHAPTER 3 THE TIME VALUE OF MONEY

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Statistical inference: example 1. Inferential Statistics

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Confidence Intervals

5 Boolean Decision Trees (February 11)

Basic Elements of Arithmetic Sequences and Series

1 Computing the Standard Deviation of Sample Means

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

1 Correlation and Regression Analysis

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Capacity of Wireless Networks with Heterogeneous Traffic

Lesson 15 ANOVA (analysis of variance)

THE HEIGHT OF q-binary SEARCH TREES

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Sampling Distribution And Central Limit Theorem

CHAPTER 3 DIGITAL CODING OF SIGNALS

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

Is there employment discrimination against the disabled? Melanie K Jones i. University of Wales, Swansea

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Elementary Theory of Russian Roulette

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Using Four Types Of Notches For Comparison Between Chezy s Constant(C) And Manning s Constant (N)

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

Present Values, Investment Returns and Discount Rates

Decomposition of Gini and the generalized entropy inequality measures. Abstract

Optimal Adaptive Bandwidth Monitoring for QoS Based Retrieval

Chapter 14 Nonparametric Statistics

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Institute of Actuaries of India Subject CT1 Financial Mathematics

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

A Mathematical Perspective on Gambling

Research Article Sign Data Derivative Recovery

Infinite Sequences and Series

INFINITE SERIES KEITH CONRAD

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

Asymptotic Growth of Functions

Now here is the important step

The Stable Marriage Problem

Estimating Probability Distributions by Observing Betting Practices

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Convexity, Inequalities, and Norms

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

INVESTMENT PERFORMANCE COUNCIL (IPC)

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Reliability Analysis in HPC clusters

THE ROLE OF EXPORTS IN ECONOMIC GROWTH WITH REFERENCE TO ETHIOPIAN COUNTRY

Mathematical goals. Starting points. Materials required. Time needed

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

3 Basic Definitions of Probability Theory

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

One-sample test of proportions

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Overview of some probability distributions.

Transcription:

Samplig olie social etworks by radom walk Jiaguo Lu, Digdig Li 2 School of Computer Sciece, Uiversity of Widsor 2 Departmet of Ecoomics, Uiversity of Widsor Email: {jlu, dli}@uwidsor.ca 40 Suset Aveue, Widsor, Otario N9B 3P4. Caada ABSTRACT This paper proposes to use simple radom walk, a samplig method supported by most olie social etworks (OSN), to estimate a variety of properties of large OSNs. We show that due to the scale-free ature of OSNs the estimators derived from radom walk samplig scheme are much better tha uiform radom samplig, eve whe uiform radom samples are available disregardig the otorious high cost of obtaiig the radom samples. The paper first proposes to use harmoic mea to estimate the average degree of OSNs. The accurate estimatio of the average degree leads to the discovery of other properties, such as the populatio size, the heterogeeity of the degrees, the umber of frieds of frieds, the threshold value for messages to reach a large compoet, ad Gii coefficiet of the populatio. The method is validated i complete Twitter data dated i 2009 that cotais 42 millio odes ad.5 billio edges. Keywords OSN, Olie Social Network, Hase-Hurwitz, Estimator, Scale free etwork, Harmoic mea. INTRODUCTION The properties of olie social etworks are of great iterests to geeral public as well as IT professioals. Yet the raw data are usually ot available to the public ad the summary released by the service providers is sketchy. Thus samplig is eeded to reveal the hidde properties or structure of the uderlyig data [5, 20, 3]. For istace, we may wat to lear the average umber of frieds i a etwork, or the average degree of Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. Copyright 202 ACM 978--4503-549-4...$5.00. a graph. Oe obvious but ofte impractical method is to select radomly a set of users {U, U 2..., U }, cout their degrees {d,..., d } for each user, ad calculate the sample mea as the estimate of the populatio mea: d SM = d i () The sample mea estimator d SM is a ubiased estimator of the populatio, if the users ca be selected radomly with equal probability. Ufortuately this is ot the case i most practice. Whe micro bloggers are selected, they are ofte ot picked radomly due to the limited access methods provided by OSN sites. Rather, more popular bloggers ted to have a higher probability of beig sampled if users are crawled by followig the liks. There are studies o the samplig methods for OSN [5, 20] ad i related areas such as social etworks [22, 26], graphs [3, 25], web URLs [8], ad search egie idex ad deep web [, 7, 6]. The typical uderlyig techiques iclude Metropolis Hastig Radom Walk (MHRW) [8] for uiform samplig ad Radom Walk (RW) [4] for uequal probability samplig. A radom walk o graph follows oe of the liks with a equal probability amog all the liks. A blogger with more followers will have higher probability of beig sampled. It is well kow that the asymptotic probability of a ode beig sampled is proportioal to its degree [4]. Therefore, the sample mea teds to overestimate the populatio average degree. MHRW is reported rather good at obtaiig a radom sample of radom etworks. However, i the samplig process may odes are retrieved, examied, ad rejected. The cost is rather high especially for OSN where the ode retrieval eeds etwork traffic ad usually there are quota for daily accesses. Eve whe uiform radom samples are obtaied, the sample mea estimator has a high variace because the degree distributio of OSNs usually follows power law. May odes have small degrees, while some odes 33

may have very large degree. The iclusio/exclusio of a super large ode i a sample will make the estimates diverge. Whe uiform radom samples are hard to obtai, it is rather commo to use PPS (Probability Proportioal to Size) samplig ad Hase-Hurwitz related estimators [7]. I particular, the harmoic mea istead of the arithmetic mea of the sample ca be used as the estimator of the average degree of OSN: [ ] d H = (2) d i Here the subscript H idicates that it is the harmoic mea, ad that it ca be derived from the traditioal Hase-Hurwitz estimator as described i the ext sectio. For this estimator the sample is obtaied by simple radom walk, resultig i the ode selectio probability proportioal to its degree. This estimator was first derived ad studied i depth by Salgaik et al. [22] to estimate the properties of hidde populatio such as drug-addicts. I that settig the true values are ukow, the assumptios such as samplig probability are flimsy, thus the veracity of the estimator is impossible to evaluate. I the cotext of OSN, Kurat et al. [, 5, 6] studied various samplig methods, icludig radom walk, to discover etwork properties such distributio of ode degrees. [5] studied the samplig of Facebook, i particular the Re-Weighted Radom Walk that ca be also traced back to Hase-Hurwitz estimator. [] metioed harmoic mea estimator, but fell short of the aalysis ad compariso of the estimator. Rasti et al. [2] studied re-weighted radom walk samplig i peer-to peer etworks. Both [5] ad [2] compare their methods with Metropolis-Hastig radom walk, ot uiform radom samples. The compariso to uiform radom samples was coducted i [0] for the estimatio of populatio size ot average degree. This is the first paper to show that i a real large etwork the harmoic mea estimator is much better tha sample mea estimator i uiform radom samples, eve igorig the cost of obtaiig the uiform samples. I practice as demostrated i Twitter etwork, the sample size ca be thousads times smaller tha uiform radom samples to achieve similar accuracy. I theory, the improvemet ca be ulimited with the growth of the etwork size. The cotributios of this paper are ) the properties of the estimator (bias ad variace) are aalyzed ad empirically verified i a large real etwork; 2) the advatage over uiform radom sample is aalyzed ad compared. I particular we foud that i Twitter data the estimator is much better it has a very small bias, ad the variace is orders of magitude smaller tha the sample mea estimator; 3) the cause is idetified as the heterogeeity of the data iduced by the scalefree ature of the etwork. Coefficiet of variatio is proposed to quatify the heterogeeity; 4) the accurate estimatio of the average degree ca lead to the discovery of a strig of other etwork properties such as the etwork size, the heterogeeity of the degrees, the threshold value for message diffusio, ad the iequality of the frieds i the etwork. We wat to emphasize that our method is ot limited to the estimatio of direct coectios betwee users i OSN. The average degree ca be the average umber of frieds i the case of Facebook or Likedi, or average followers ad followees i Twitter ad Weibo etworks. I additio to such explicit graph where edges represet the followig (or fried) relatioships, i OSNs there are implicitly derived graphs where a edge exists if two odes share messages, groups, etc.., resultig i message etworks ad group etworks. I a message etwork, two persos are liked if they shared a message. I group etwork, two persos are coected if they belog to the same group. Thus, the degree ca represet the direct coectios to frieds, the umber of message reposts o the etwork, or the umber of groups people are associated with. 2. ESTIMATORS 2. Sample mea estimator Suppose that i the populatio there are N umber of users. Each user has a property Y i, i {, 2,..., N}, which ca be age, umber of frieds, or umber of messages etc.. Let the populatio total is τ = N Y i, ad populatio mea is Y = τ/n. Our task is to estimate Y usig a sample. I particular, this paper focuses o the degree property, i.e., estimatig the average degree d usig a sample {d, d 2,..., d }. If a uiform radom sample Y,..., Y is obtaied, the sample mea is a ubiased estimator as defied below: Ŷ SM = Y i (3) Whe Y i is the degree of ode i, i.e., Y i = d i, the above equatio becomes the sample mea estimator for degrees: d SM = d i (4) The variace of the estimator d SM is [24] var( d SM ) = N σ 2 N (5) where σ 2 is the populatio variace for degrees that 34

ca be calculated by ( ) 2 σ 2 = d 2 i d i N N = N d 2 i d 2 (6) where d is the arithmetic mea of all the degrees i the total populatio. The estimated variace of the estimator d SM is var( d SM ) = N s 2 N (7) where s 2 is the sample variace of d, d 2,..., d. The problem with this sample mea estimator is that uiform sample is ot easy to obtai. Moreover, the populatio variace σ 2, ad cosequetly the estimator variace, are large due to the scale-free ature of the etwork. The degree distributio i olie social etworks follows power law or Zipf law. That is, if we rak all the odes accordig to their degrees i decreasig order (d, d 2,..., d N ), the d i = A i α, (8) where A ad α are costats. α is called the expoet or slope that is typically aroud oe i various scale-free etworks. With such degree distributio the populatio variace is very large, leadig to large variace of the sample mea estimator. Suppose that α =, which is typical for may scale free etworks [9] icludig Twitter etwork [2]. σ 2 ca be approximated as below by combiig Equatios 8 ad 6: σ 2 = E(X 2 ) E 2 (X) ( E(X 2 ) ) = E 2 (X) E 2 (X) ( N ) N d2 i = = ( N d i) 2 ( N N i 2 ( N d 2 ) d 2 i ) 2 ) d 2 (9) ( N l 2 N It shows that the variace does ot coverge whe the etwork size N grows to the limit. Note that there are two ways to describe the property of power law, oe usig the Zipfia approach as used here, the other is the frequecy of the degrees that is equivalet to Zipfia approach except that the expoet is greater by oe. 2.2 Harmoic mea estimator Whe samplig probability is ot equal for each uit, a commo approach is to use Hase-Hurwitz estimators. Oe of them is to estimate the populatio total [24]: τ HH = Y i p i, (0) where p i is the selectio probability of uit i, τ = N Y i is the populatio total, ad N p i =. Selectio probability of uit i is the probability it is selected i oe draw of the sample elemets. Note that Hase-Hurwitz estimator is used whe samplig with replacemets, i.e., a uit ca be sampled multiple times just the same as i radom walk samplig. Whe Y i = for all i {, 2,..., N}, the above estimator is reduced to aother versio of Hase-Hurwitz estimator that estimates the total umber of odes N = N Y i: N HH = p i () I our OSN case, samples are ofte obtaied by radom walk. It is well kow that radom walk obtais a biased sample. Asymptotically the probability of a user beig visited i a radom walk is proportioal to its degree, i.e., i the case of radom walk, p i = d i N j= d j = d i τ (2) Therefore a estimator for degree mea d H ca be derived from the ubiased Hase-Hurwitz estimator for N as follows: d H = τ N HH ] [ τ = τ d i [ ] = (3) d i The estimator for the arithmetic mea degree turs out to be the harmoic mea of the degrees i the sample. Salgaik et al [22] gave a similar derivatio usig the ratio of two estimators i the settig of respodet drive samplig. Although N HH is a ubiased estimator, its iverse may ot be ubiased. Cochra [3] showed that the bias is o the order of /. Sice the sample size i social etwork samplig is rather large i geeral, the bias is egligible. 35

250 200 (A) 90 80 (B) Table : Empirical bias ad stadard error of the two estimators over 00 rus for various sample size. Est d Est d 50 00 50 0 2 Sample size 3 4 x 0 5 000 800 600 400 200 (C) 0 500 000 2000 4000 8000 Sample size Est d Est d 70 60 50 0 2 Sample size 3 4 x 0 5 00 90 80 70 60 50 40 (D) 500 000 2000 4000 8000 Sample size Figure : Compariso of d SM i UR (Uiform Radom) samplig ad d H i RW (Radom Walk) samplig. Paels A (for UR) ad B( for RW) show that the estimatio fluctuates with the icrease of sample size. Paels C (for UR) ad D (for RW) show the box plots of 00 estimatios for sample sizes ragig betwee 500 ad 8000. The variace of N HH is var( N HH ) = p i (/p i N) 2 (4) It ca be estimated from a sample usig var( N HH ) = (/p i N) 2 (5) ( ) Usig Delta method the variace of estimator d H is var( d H ) = s2 v v 4 (6) where v i = /d i, v ad s 2 v are the sample mea ad variace of v i s. This equatio will be used i calculatig the error boud i Figure 2. 3. EXPERIMENTS 3. Data The estimator is verified o the Twitter etwork data that are provided by Kwak et al. [2], characterizig the complete Twitter etwork as of July 2009. The data cotai about.47 billio edges ad 4.7 millio Bias Stadard error UR RW UR RW 500 2.6295.444 08.054 2.0539 000-4.52 0.06 53.8785 8.7383 2000-0.8226-0.0320 36.2923 5.6482 4000 4.0328-0.2842 45.0989 4.57 8000 2.037-0.0674 25.908 2.7238 odes or users, occupyig aroud 20 gigabytes hard drive space. Sice they are too large to fit ito the memory of commodity computers, we idex them usig Lucee, a popular idex egie. The the radom walk ad uiform radom samplig are performed o the idex that are stored i hard drive. Sice our method is better to be used i udirected graph, we remove the directio i Twitter data. 3.2 Results Two estimators, d SM i Equatio ad d H i Equatio 3, are tested o the data for five differet sample sizes 500, 000, 2000, 4000, ad 8000. For each sample size 00 samples are selected usig uiform radom samplig ad radom walk samplig respectively. Their bias ad stadard errors are tabulated i Table. It shows that ideed d H has a very small bias as expected. What is strikig is that its stadard error is much smaller tha d SM. We use Figure to explai the result further. Paels C is the box plot for d SM usig uiform samplig. It shows that the estimatio fluctuates very much, ca eve go as high as 000 whe =500, where the true mea is 70.5. The big variace problem is ameliorated slightly but remais large with the growth of the sample size. O the other had the box plot for d H i Pael D shows much smaller variace. We also ru five large samples, each with size 4 0 5, as depicted i paels A (for UR) ad B (for RW). Note that i the case of uiform radom samplig, the estimate jumps from time to time eve whe the sample size is rather large. Figure 2 shows four estimatios bouded by the 95% cofidece iterval calculated by Equatio 6. 3.3 Discussios This paper shows that the biased samplig is much better tha uiform samplig for the estimatio of average degrees. I the past, people try to obtai uiform samples wheever possible, ad resort to biased 36

20 0 Radom walk 0 7 0 6 RW UR 00 0 5 d 90 80 70 degree 0 4 0 3 60 0 2 50 0 40 0 2000 4000 6000 8000 0000 sample size Figure 2: 95% cofidece iterval ad four RW (Radom Walk) estimatio processes usig d H estimator. The error boud is draw from Equatio 6. 0 0 0 0 0 2 0 4 0 6 Figure 3: The degree distributios of the samples obtaied from UR (Uiform Radom) ad RW (Radom Walk) sampligs. =500,000. The odes, icludig the oes beig repeatedly sampled, are raked i decreasig order of their degrees, ad draw with degrees agaist their raks. rak samplig such as PPS (Proportioal To Size) samplig oly whe uiform samplig is impossible [22] or costly. The results of this paper suggest that i the cotext of olie social etworks, radom walk samplig istead of uiform samplig should be used, eve whe uiform radom samples are readily accessible. It is easy to uderstad that the variace of uiform radom estimator d SM is large because olie social etworks are mostly scale-free as show i Equatio 9. The smaller variace of d H ca be explaied below. Let d W be the radom variable for the degrees sampled by radom walk. First we draw its empirical distributio ad its compariso with uiformly sampled degrees i Figure 3. Uiform radom (UR) samples resemble the distributio of the total populatio [23] that obeys power law with expoet aroud oe. O the other had, i radom walk (RW) samplig scheme d W has a flatter startig sectio ad a droopig tail, which ca be approximated by the Madelbrot law: d W i = B (a + i) b (7) where b is the expoet, B is a ormalizatio costat, a is a costat that correspods to the positio where the curve droops dow. Let v =. (8) d W i The variace of the reciprocal of the variable is ( var(/d W ) (i + a)2b ) = ( (i + a)b ) 2 v 2 ( ) 2 = (i + a) 2b (i + a) b v 2 ( [ ] [ ] 2 2b + 2b+ b + b+ ) ( ) (b + ) 2 2b + v 2 Thus var(/d W ) is a costat that does ot grow with the populatio size as σ 2 does. 4. IMPLICATIONS Average degree plays a pivotal role i discoverig other properties of a large etwork. Its accurate estimatio has a ramificatio o a strig of other hidde properties of large etworks. Oe immediate result is the total umber of edges i the graph whe user size is kow. However, the more profoud cosequece is that we ca discover the heterogeeity, CV (Coefficiet of Variatio), of the etire etwork with a small sample usig average degree. The discovery of CV will i tur deduce other properties such as the total umber of users, the iequality of degrees (frieds of frieds ad Gii coefficiet). 4. Estimate heterogeeity d ca be used to estimate CV, Coefficiet of Vari- v 2 37

2200 2000 4.8 5 x 07 800 4.6 Est CV sqr 600 400 200 000 800 0 000 2000 3000 4000 5000 6000 7000 Sample size Est N 4.4 4.2 4 3.8 3.6 3.4 0 2 3 4 5 6 7 Sample size x 0 4 Figure 4: 5 Estimatio processes of γ 2 i Twitter data usig Equatio 20. The red dotted lie is the true value. Figure 5: 5 estimatio processes of twitter accouts N usig Equatio 2. Red dotted lie is the true value. atio (deoted as γ), that is a importat metric to measure the heterogeeity of degree distributio. It is defied as the stadard deviatio ormalized by the average degree: γ 2 = σ 2 /d 2. Expadig the defiitio for variace we have γ 2 + = d2 d 2 d 2 + [ ] 2 = d 2 i d i N N [ N ] 2 = N d i d 2 i O the other had the sample mea of the degrees obtaied by radom walk is d W = d W i = p i d i = Nd d 2 i (9) Combiig the two equatios we derive the estimator for CV as follows: γ 2 + = dw d, (20) where d W is the sample mea of the degrees obtaied by radom walk, d ca be estimated by the arithmetic mea of the same data. The coveiece of the method is that oly oe radom walk is eeded. Figure 4 shows 5 estimates that coverge quickly with the growth of the sample size. 4.2 Populatio estimatio Oce γ 2 is available, it ca be used to estimate the populatio size as follows, which is a special case of Eq 3.20 i [2]: ( ) N = (γ 2 + ) 2 C, (2) where is the sample size, C is the umber of collisios, ad the sample is obtaied by radom walk 2. I the area of capture-recapture research [2, 7, 5], it has bee a perplexig problem for the populatio estimatio of heterogeeous data whose capture probabilities are uequal, maily due to the difficulty of estimatig the heterogeeity. Now i the settig of OSN, the problem is solved thaks to the estimator d H. Because of the accurate predicatio of the heterogeeity of the data (γ 2 ), the estimatio of populatio size is rather good as show i Figure 5. Sice this estimator higes o collisio times, extra cautio should be take to avoid spurious collisios caused by radom walk. For istace if a ode A is oly coected to ode B, a visit to A will cause ode B visited twice. To avoid such loops, we take samples spaced every a few steps apart. 4.3 Other properties 4.3. Frieds of frieds γ 2 ca be also used to measure the ratio betwee the umber of frieds of your frieds, ad the umber of your frieds. As the sayig goes, your frieds have more frieds tha you do. To be more precise, your frieds have γ 2 + times more frieds tha you do. The mea umber of frieds of frieds is [4] d 2 i / d i = d + σ 2 /d (22) 2 Here is a simple derivatio for the estimator. The expected umber of collisios is ( ) N ( ) ( ) E(C) = p 2 i = d 2 γ 2 + 2 2 τ 2 i = 2 N 38

The above equatio shows that your frieds have o less tha the frieds you have. Simple rearragig the equatio results i: N d2 i / N d i = + σ 2 /d 2 d = + γ 2 (23) I words, the equatio says that o average your frieds have + γ 2 times more frieds the you do. I a homogeeous etwork where everybody has the same umber of coectios, γ = 0, thus your frieds have the same umber of frieds as you do. I twitter society, γ 2 is aroud 000, thus your frieds have a thousad times more frieds tha you do. 4.3.2 Message diffusio Alog the same lie γ 2 ca be used to quatify the diffusio of messages that is borrowed from epidemiology. I particular, it ca be derived that the threshold for the occurrece of large compoet, or the occurrece of epidemics [9] (Eq 7.8) is π = (γ2 + )d 2 (γ 2 + )d, (24) where π is the proportio of the odes that are immued uiformly from the etwork. 4.3.3 Clusterig Coefficiet Some structural etwork properties ca be also derived usig γ 2. For istace, oe importat etwork property is Clusterig Coefficiet, idicatig the proportio whether your fried of fried is also your fried. It is hard to calculate directly for a large etwork, but ca be estimated [9] (eq 3.47) by 4.3.4 Gii coefficiet dγ 4 /. (25) Gii idex is used to measure the iequality of wealth. It ca also be used to measure the iequality of friedships i OSNs. Usig d the Gii coefficiet ca be approximated by Ĝ = d i d j (26) 2( )d j= The classic problem of Gii coefficiet estimatio is that the mea is hard to obtai. Thaks to the estimatio of average degree, i Twitter etwork, we fid its Gii coefficiet is aroud 0.70-0.82. 5. CONCLUSIONS This paper proposes to use radom walk to sample a etwork ad use the harmoic mea to estimate the average degree. The empirical experimets show that the estimator is much better eve tha uiform radom samples. The method is very practical i that i thousads or eve hudreds of steps of radom walk we ca lear the average degree of a large etwork cotaiig tes of millios of odes ad billios of edges. The method works well because of the scale-free ature of the uderlyig etwork where the variace teds to be very large, potetially ulimited whe the etwork size becomes ifiitely large. For such etworks, we aalytically showed that the harmoic mea estimator removed the large variace problem. Therefore the estimator works ot oly for olie social etworks, but also ay scale-free etworks that are ubiquitous ad more commo tha radom etworks. For istace, we also validated the estimator i documet-term graph where documet ad terms are odes, ad they are coected if a documet cotais a term. The method relies o the assumptio that radom walk produces samples whose selectio probability is proportioal to their degrees. Theoretically this is true oly asymptotically. Therefore the samples before the mixig time should be throw away. Our experimets show little differece whether or ot to iclude the first batch of samples i the radom walk. The degree estimatio is ot oly importat by itself but also crucial for discoverig other etwork properties. The success solutio of average degree ca lead to the discovery of the heterogeeity of the uderlyig data, the user ad lik size etc. The method is ot restricted to the degrees of the explicit etworks where the edges are the friedship relatios. Istead, the edges ca be forged by other implicit relatios, such as sharig the same message. 6. ACKNOWLEDGEMENTS We thak the reviewers for their detailed commets, ad the support from NSERC (Natural Scieces ad Egieerig Research Coucil of Caada) ad SSHRC (Social Scieces ad Humaities Research Coucil of Caada). 7. REFERENCES [] Z. Bar-Yossef ad M. Gurevich. Radom samplig from a search egie s idex. Joural of the ACM (JACM), 55(5):24, 2008. [2] A. Chao, S. Lee, ad S. Jeg. Estimatig populatio size for capture-recapture data whe capture probabilities vary by time ad idividual aimal. Biometrics, pages 20 26, 992. [3] W. Cochra. Samplig techiques. Wiley-Idia, 2007. [4] S. Feld. Why your frieds have more frieds tha you do. America Joural of Sociology, pages 39

464 477, 99. [5] M. Gjoka, M. Kurat, C. Butts, ad A. Markopoulou. A walk i facebook: Uiform samplig of users i olie social etworks. Arxiv preprit arxiv:0906.0060, 2009. [6] M. Gjoka, M. Kurat, C. Butts, ad A. Markopoulou. Practical recommedatios o crawlig olie social etworks. Selected Areas i Commuicatios, IEEE Joural o, 29(9):872 892, 20. [7] M. Hase ad W. Hurwitz. O the theory of samplig from fiite populatios. The Aals of Mathematical Statistics, 4(4):333 362, 943. [8] M. Heziger, A. Heydo, M. Mitzemacher, ad M. Najork. O ear-uiform url samplig. Computer Networks, 33(-6):295 308, 2000. [9] M. Jackso. Social ad ecoomic etworks. Priceto Uiv Pr, 2008. [0] L. Katzir, E. Liberty, ad O. Somekh. Estimatig sizes of social etworks via biased samplig. I Proceedigs of the 20th iteratioal coferece o World wide web, pages 597 606. ACM, 20. [] M. Kurat, A. Markopoulou, ad P. Thira. Towards ubiased bfs samplig. Selected Areas i Commuicatios, IEEE Joural o, 29(9):799 809, 20. [2] H. Kwak, C. Lee, H. Park, ad S. Moo. What is twitter, a social etwork or a ews media? I Proceedigs of the 9th iteratioal coferece o World wide web, pages 59 600. ACM, 200. [3] J. Leskovec ad C. Faloutsos. Samplig from large graphs. I Proceedigs of the 2th ACM SIGKDD iteratioal coferece o Kowledge discovery ad data miig, pages 63 636. ACM, 2006. [4] L. Lovász. Radom walks o graphs: A survey. Combiatorics, Paul Erdos is Eighty, 2(): 46, 993. [5] J. Lu. Efficiet estimatio of the size of text deep web data source. I Proceedig of the 7th ACM coferece o Iformatio ad kowledge maagemet, pages 485 486, Napa Valley, Califoria, USA, 2008. ACM. [6] J. Lu. Rakig bias i deep web size estimatio usig capture recapture method. Data & Kowledge Egieerig, 69(8):866 879, 200. [7] J. Lu ad D. Li. Estimatig deep web data source size by capture recapture method. Iformatio retrieval, 3():70 95, 200. [8] N. Metropolis, A. Rosebluth, M. Rosebluth, A. Teller, ad E. Teller. Equatio of state calculatios by fast computig machies. The joural of chemical physics, 2:087, 953. [9] M. Newma. Networks: a itroductio. Oxford Uiversity Press, Ic., 200. [20] M. Papagelis, G. Das, ad N. Koudas. Samplig olie social etworks. Kowledge ad Data Egieerig, IEEE Trasactios o, (99):, 20. [2] A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Williger, ad D. Stutzbach. Respodet-drive samplig for characterizig ustructured overlays. I INFOCOM 2009, IEEE, pages 270 2705. IEEE, 2009. [22] M. Salgaik ad D. Heckathor. Samplig ad estimatio i hidde populatios usig respodet-drive samplig. Sociological methodology, 34():93 240, 2004. [23] M. Stumpf, C. Wiuf, ad R. May. Subets of scale-free etworks are ot scale-free: samplig properties of etworks. Proceedigs of the Natioal Academy of Scieces of the Uited States of America, 02(2):422, 2005. [24] S. Thompso. Samplig. Wiley, 202. [25] T. Wag, Y. Che, Z. Zhag, T. Xu, L. Ji, P. Hui, B. Deg, ad X. Li. Uderstadig graph samplig algorithms for social etwork aalysis. I the 3rd ICDCS Workshop o Simplifyig Complex Networks for Practitioers, 20. [26] C. Wejert ad D. Heckathor. Web-based etwork samplig. Sociological Methods & Research, 37():05 34, 2008. 40