Online Appendix to Social Network Formation and Strategic Interaction in Large Networks



Similar documents
Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Distribution Analysis

Temporal Dynamics of Scale-Free Networks

Homophily in Online Social Networks

Effects of node buffer and capacity on network traffic

Gamma Distribution Fitting

ModelingandSimulationofthe OpenSourceSoftware Community

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Joan Paola Cruz, Camilo Olaya

SAS Software to Fit the Generalized Linear Model

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Introduction to Networks and Business Intelligence

Analyzing the Facebook graph?

Pull versus Push Mechanism in Large Distributed Networks: Closed Form Results

Non Parametric Inference

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

LOGNORMAL MODEL FOR STOCK PRICES

GUIDANCE FOR ASSESSING THE LIKELIHOOD THAT A SYSTEM WILL DEMONSTRATE ITS RELIABILITY REQUIREMENT DURING INITIAL OPERATIONAL TEST.

IN recent years, advertising has become a major commercial

Lectures 5-6: Taylor Series

Worksheet for Teaching Module Probability (Lesson 1)

Lecture 8 : Coordinate Geometry. The coordinate plane The points on a line can be referenced if we choose an origin and a unit of 20

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Lecture 3: Growth with Overlapping Generations (Acemoglu 2009, Chapter 9, adapted from Zilibotti)

How To Check For Differences In The One Way Anova

Some stability results of parameter identification in a jump diffusion model

A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS ABSTRACT

Supplement to Call Centers with Delay Information: Models and Insights

START Selected Topics in Assurance

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Gains from Trade: The Role of Composition

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Introduction to Regression and Data Analysis

THE CENTRAL LIMIT THEOREM TORONTO

Random graphs with a given degree sequence

arxiv:physics/ v2 [physics.comp-ph] 9 Nov 2006

Poisson Models for Count Data

What Is Probability?

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Accelerated growth of networks

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 1 Solutions

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Factoring & Primality

Common Core Unit Summary Grades 6 to 8

SUMAN DUVVURU STAT 567 PROJECT REPORT

Persuasion by Cheap Talk - Online Appendix

Exploring contact patterns between two subpopulations

CHAPTER 2 Estimating Probabilities

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Visualization and Modeling of Structural Features of a Large Organizational Network

Advanced CFD Methods 1

Asymmetry and the Cost of Capital

Mathematical Finance

LECTURE 15: AMERICAN OPTIONS

Chapter 3 RANDOM VARIATE GENERATION

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Continued Fractions. Darren C. Collins

An Interest-Oriented Network Evolution Mechanism for Online Communities

Understanding Poles and Zeros

The degree, size and chromatic index of a uniform hypergraph

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Generating Random Samples from the Generalized Pareto Mixture Model

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Nonparametric adaptive age replacement with a one-cycle criterion

9. Sampling Distributions

A scalable multilevel algorithm for graph clustering and community structure detection

**BEGINNING OF EXAMINATION** The annual number of claims for an insured has probability function: , 0 < q < 1.

Higher Education Math Placement

6 Scalar, Stochastic, Discrete Dynamic Systems

TABLE OF CONTENTS. 4. Daniel Markov 1 173

STA 4273H: Statistical Machine Learning

Goal Problems in Gambling and Game Theory. Bill Sudderth. School of Statistics University of Minnesota

The mathematics of networks

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Fairfield Public Schools

Graphs of Polar Equations

Characteristics of Binomial Distributions

ESTIMATING THE DISTRIBUTION OF DEMAND USING BOUNDED SALES DATA

Review of Fundamental Mathematics

Complex Networks Analysis: Clustering Methods

Evolution of the Internet AS-Level Ecosystem

Robustness of Spatial Databases: Using Network Analysis on GIS Data Models

Introduction to Logistic Regression

Graphical degree sequences and realizations

Effective Page Refresh Policies For Web Crawlers

Measurement with Ratios

Chapter 4. Polynomial and Rational Functions. 4.1 Polynomial Functions and Their Graphs

In this chapter, you will learn improvement curve concepts and their application to cost and price analysis.

Stock price fluctuations and the mimetic behaviors of traders

Call Price as a Function of the Stock Price

Pr(X = x) = f(x) = λe λx

Key Concept. Density Curve

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

How High a Degree is High Enough for High Order Finite Elements?

Transcription:

Online Appendix to Social Network Formation and Strategic Interaction in Large Networks Euncheol Shin Recent Version: http://people.hss.caltech.edu/~eshin/pdf/dsnf-oa.pdf October 3, 25 Abstract In this online appendix, I present a continuous time approach to my model with a parametric form Φ(d) = d α. The resulting degree distribution is the Weibull distribution (Weibull, 95). I fit my model to four empirical degree distributions by using the method of maximum likelihood estimation. I compare the goodness-of-fit of my model with other three prominent degree distributions generated by Erdős and Rényi (959), Barabási et al. (999), and Jackson and Rogers (27). I use the Kolmogorov-Smirnov statistic as a way to find the best fitting distribution. Models and Distributions. Continuous Time Approach To obtain a tractable form of the resulting degree distribution, I apply a continuous time approach to my model. In this approach, a node s degree changes deterministically at a rate proportional to its expected change. Specifically, the degree of node i changes according to the following ordinary differential equation: dd i (t) dt = d i(t) µt d i (t), (.) α where d i (t) is the degree of node i at time t, µ > is the parameter that represents the rate of new nodes entering the network, and α > is the parameter that captures the cost See Chapter 5 in Jackson (2) and references therein for examples and discussions about this approach.

of forming new links. The first term shows that higher degree nodes have a greater chance of being identified by newborn nodes, and the second term shows that higher degree nodes are more selective with additional links. When the parameter α is one, the probability that node i forms additional links is independent of its degree. However, when parameter α is larger than one, node i is less likely to form additional links as its degree increases. I solve the differential equation (.) and find that d i (t) α d i (i) α = α µ log ( t i To find an expression for the complementary cumulative degree distribution, I identify the proportion of nodes with degrees that exceed a given degree d at time t. For this, it suffices to find the node having exactly degree d at time t because all nodes born before time t have larger degrees. Let i t (d) be the node that has degree d at time t: d it(d)(t) = d. In addition, let d be node i t (d) s initial degree: d it(d)(i t (d)) = d. Then, I have ). i t (d) t = e ((λd)α (λd ) α), where λ = ( µ α) α. Finally, by setting d =, the complementary cumulative degree distribution takes a form of a Weibull distribution as F (d; λ, α) = e (λd)α. The hazard rate function of the Weibull distribution is increasing if and only if the parameter α is larger than one. The following proposition summarizes the results. Proposition The resulting complementary cumulative degree distribution has a form of F (d; λ, α) = e (λd)α, where λ = ( µ α) α. The hazard rate function is strictly increasing if and only if α >..2 Other Models and Distributions Poisson model. The uniform random network formation model by Erdős and Rényi (959) generates the Poisson distribution as the resulting degree distribution. In their model, a link between two nodes is formed independently of other pairs of nodes with a fixed probability. The resulting degree distribution is approximated by the Poisson distribution when the 2

network size is sufficiently large. Specifically, the Poisson distribution with parameter ρ > has the form of f(d; ρ) = e ρ ρ d. d! The hazard rate of the Poisson distribution is strictly increasing for any value of ρ. Preferential attachment model. The scale-free distribution is another prominent degree distribution generated by the preferential attachment (PA) model (Barabási et al., 999). In the PA model, a newborn node forms links with existing nodes probabilistically, and the probability of selecting a particular existing node is proportional to its degree. The resulting degree distribution is the scale-free distribution, which has the form of f(d; γ) = cd γ, where γ > is the scale parameter, and c > is the normalization factor that is calculated as c = d=d min d γ when d min is the minimum degree of a network. 2 Assuming that the minimum degree is one, the complementary cumulative degree distribution has the form of F (d; γ) = ζ(γ, d) ζ(γ), where ζ(γ) = k= k γ is the Riemann zeta function, and ζ(γ, d) = k= (k + d) γ for d. Note that the Riemann zeta function is well-defined if and only if γ >. Network-based search model. The network-based search model by Jackson and Rogers (27) (henceforth, the JR model) produces another prominent degree distribution. In the JR model, nodes form links in two ways: (i) uniformly and randomly, and (ii) by searching locally through the given structure of the network. There are two parameters m > and r >. m is the expected number of links that a new node forms, and r is the ratio of the number of links that are formed uniformly at random comparing to network-based meetings. 2 Its name originates from the fact that for any two degrees of a fixed ratio, their probability ratios are independent of the scale of degrees: f(d)/f(d ) = f(sd)/f(sd ) for all d, d N and s R +. A scalefree distribution can be defined either over continuous real numbers or discrete positive integers. Although a continuous distribution intuitively explains its various properties, an estimation by using it generates systematic errors (Clauset et al., 29). 3

Notice that r can be infinity, which means that the network formation is purely random. When r <, the resulting complementary cumulative degree distribution has the form of ( ) +r d + rm F (d; r, m) =, d + rm where d is the parameter that calculates the number of links that each node forms upon its birth. Following Jackson and Rogers (27), I set d = for the later parameter estimation. 2 Fitting Models to sets 2. sets I consider the following four real network datasets: (a) the 75 social networks of rural Indian villages, (b) a collaboration network of jazz musicians, (c) an online friendship network of Facebook users, and (d) the network among webpages at Notre Dame University. 3 Figure plots their empirical hazard rate functions. I present the hazard rate function of village #58 as an illustrating example. One can observe that the empirical hazard rates exhibit increasing patterns for datasets (a) and (b), but decreasing patterns for datasets (c) and (d)..6.2.4.4 Hazard Rate.4.2 Hazard Rate.5..5 Hazard Rate.3.2. Hazard Rate.3.2. 2 4 6 8 2 4 6 5 5 2 4 6 8 2 (a) Indian village #58 (b) Jazz musicians (c) Facebook users (d) Webpages Figure : Different patterns of the hazard rate function across datasets 2.2 Estimation Strategy and Comparison Estimation strategies. I fit the four theoretical degree distributions generated by network formation models to the empirical degree distributions of the above four network datasets. I 3 I obtain dataset (a) from Banerjee et al. (23), dataset (b) from Gleiser and Danon (23), dataset (c) from McAuley and Leskovec (22), and dataset (d) from Jackson and Rogers (27). All datasets are publicly available from their corresponding author s webpage. 4

explain how to estimate model parameters. First, I estimate parameters of the Weibull distribution by using the method of maximum likelihood estimation, which is well-documented in the statistics literature (e.g., Rinne, 28). Specifically, for a given dataset {d,..., d n } where d i is the degree of node i {,..., n}, the maximum likelihood estimator of α is identified as a solution of α = n i= dα i log d i n i= dα i n n log d i. One can find a solution of the above equation by using a standard iterative procedure (e.g., the Newton-Raphson method). The solution is unique because the left-hand side is strictly decreasing in α, but the right-hand side is strictly increasing in α. Second, I estimate the parameter ρ > for the Poission distribution by using the method of maximum likelihood estimation. In particular, for a given dataset {d,..., d n }, the maximum likelihood estimator ρ is the average degree from a dataset as ρ = n d i. n i= Third, I estimate the scale parameter γ by using the method of maximum likelihood estimation (Clauset et al., 29), which solves the following equation: ζ (γ) ζ(γ) = n log d i. n A solution of the above equation always exists and is unique because the left-hand side is strictly decreasing in γ, but the right-hand side is constant. Finally, for the JR model, I estimate two parameters m and r according to the iteration method suggested by Jackson and Rogers (27). To begin with, estimate m as the average i= degree from a dataset. To estimate r, note first that log F (d; r, m) = ( + r) log ( rm ) ( + r) log ( d + rm ). I use a method of iterative least squares as follows. Start with an initial value of r, say r, and plug in this value to obtain log ( d + r m ). Estimate ( + r) by using the method of least squares to obtain r. Repeat the same procedure until a fixed point r is achieved. If i= 5

the procedure converges, a fixed point is unique regardless of the initial value r ; otherwise, set r =. When r =, use the exponential distribution as the limit degree distribution. Goodness-of-fit. As a way to compare the models in terms of data fitting performance, I use the Kolmogorov-Smirnov (KS) statistic (Smirnov, 944). 4 The KS statistic is calculated as follows. Let {d,..., d n } be a given dataset. Then, I find the empirical cumulative degree distribution F n ( ) as F n (d) = n n {d min d i d}. i= For each model, let F ( ; θ) be the estimated cumulative degree distribution from the estimated parameter θ. The KS statistic is defined as sup F n (d) F (d; θ). d By the Glivenko-Cantelli theorem, the KS statistic converges almost surely to zero as the dataset size n becomes large if the dataset is generated according to the given model (Mood et al., 974). Thus, the smaller value of the KS statistic means the better data fitting performance of the model. Notice that this convergence is independent of the increasing hazard rate property of the true underlying degree distribution. 2.3 Results I first introduce the key parameters of the models. In my model, α > is the key parameter that distinguishes my model from other models; the other parameter λ > provides no additional information but the average degree of the network, which is available information from other three models. Recall that α indicates the behavior of the hazard rate function. In the Poisson model, there is only one parameter ρ >. There is also a single parameter γ > in the PA model. In the JR model, the key parameter is r (, ] that represents the ratio of network-based meetings; the other parameter m calculates the average degree of the network. When the estimate for r is infinity, I report the KS statistic 4 See Chapter in Mood et al. (974) for details of this test. 6

by using an exponential distribution as the limit distribution. 5 In Table, I report the estimates for the above four parameters and the KS statistics across datasets. The first and second lowest KS statistics are marked by and, respectively. 6 Since there are 75 networks in the Indian villages dataset, I present the minimum and the maximum values of the estimates and the KS statistics. Figure 2 illustrates the goodness-of-fit of the models across the datasets. My model Poisson model PA model JR model set α > ρ > γ > r > KS statistic KS statistic KS statistic KS statistic Indian villages Jazz musicians Facebook users Websites [.4, 2.39] [3.8, 7.4] [.49, 2.39] [2.38, ] [.5,.59] [.35,.5] [.3,.443] [.98,.456].57 27.7.28.42.38.39.38.96 43.25.27 2.54.53.538.343.33.74 4.2.98.57.248.53.6.23 Table : Parameter estimates and the goodness-of-fit of the models across datasets First of all, I find that all the estimates for the parameter α are consistent with simple eyeball tests by using Figure. Specifically, estimates for the networks of Indian villages and jazz musicians are all strictly greater than one, which confirm that hazard rates are increasing. Since the KS statistics of my model are very low for these networks, the inference about the increasing hazard rates is significant. On the contrary, for the networks of Facebook users and websites, the estimates are all strictly smaller than one. For the 75 social networks of Indian villages, my model and the Poisson model fit the datasets better. Although the lowest KS statistic of the JR model is.98, the second lowest KS statistic is.7, which is strictly greater than the largest KS statistic of my 5 The exponential distribution is a special case of the Weibull distribution with λ = m/2 and α =. 6 The estimate of the network of websites using the PA model is different from the values reported by Barabási and Albert (999). This result occurs because they use the usual least square estimates assuming a continuous scale-free distribution, but I use the method of maximum likelihood estimation assuming discrete scale-free distribution. 7

.75.5.25 My Model 2 4 6 8.75.5.25 Poisson Model 2 4 6 8.75.5.25 JR Model 2 4 6 8.75.5.25 PA Model 2 4 6 8 (a) Indian village #58.75.5.25 My Model 2 3 4 5.75.5.25 Poisson Model 2 3 4 5.75.5.25 JR Model 2 3 4 5.75.5.25 PA Model 2 3 4 5 (b) Jazz musicians.75.5.25 My Model 5 5.75.5.25 Poisson Model 5 5.75.5.25 JR Model 5 5.75.5.25 PA Model 5 5 (c) Facebook users.75.5.25 My Model 2 4 6 8 2.75.5.25 Poisson Model 2 4 6 8 2.75.5.25 JR Model 2 4 6 8 2.75.5.25 PA Model 2 4 6 8 2 (d) Webpages Figure 2: An illustration of the goodness-of-fit across datasets and models: In each plot, the horizontal axis represents degrees, and the vertical axis represents the empirical (the red circles) and estimated (the black line) cumulative degree distributions. For each model, the KS statistic measures the maximum difference between the two distributions. 8

model. Moreover, it turns out that for 7 social networks, the estimates for r are infinity. 7 This implies that the JR model loses its explanatory power because r = corresponds to the random attachment model. The difference between my model and the Poisson model becomes clearer when I compare their KS statistics for the networks of jazz musicians and Facebook users. My model still fits both datasets well (the KS statistics are about.5) but the Poisson model returns very large KS statistics that are greater than.3. This difference becomes clearer when observing the shape of the fitted cumulative distributions in Figure 2-(b) and -(c). The fitted distributions when using the Poisson model are S-shaped, but the empirical distributions show different patterns. In contrast, the shape of the fitted distributions by my model is flexible and performs well for the networks of jazz musicians and Facebook users. The main force driving the flexibility of my model is that it contains two parameters, λ and α, that determine the size and shape of the distribution. For example, the size of the jazz musician network is large in that its average degree is about 27. In my model, the scale parameter λ explains the average degree, and the shape parameter α explains the shape of the empirical distribution. However, there is only one parameter ρ in the Poisson model, which indicates only the scale of the distribution. Thus, when the network is large, my model outperforms the Poisson distribution even when the empirical hazard rate function exhibits increasing patterns. Finally, for the network of webpages, the PA model performs significantly better than any other models. The KS statistic is very small (.6), which is the lowest value among all the KS statistics across datasets and models. In comparison to my model, this dominant performance of the PA model is not surprising: my model tries to explain networks with bilateral and costly link formation, but the network of webpages is based on unilateral and costless link formations. Therefore, I conclude that my model fits empirical degree distributions very well whenever link formations are bilateral and costly. 7 Only villages #3, #9, #3, and #36 return finite estimates for r. 9

References Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (23). The diffusion of microfinance. Science, 34(644). Barabási, A.-L. and Albert, R. (999). Emergence of scaling in random networks. Science, 285. Barabási, A.-L., Albert, R., and Jeong, H. (999). Mean-field theory for scale-free random networks. Physica A, 272:73 87. Clauset, A., Shalizi, C. R., and Newman, M. E. J. (29). Power-law distributions in empirical data. SIAM Review, 5(4):66 73. Erdős and Rényi, A. (959). On random graphs i. Publications Mathematicae Debrecen, 6:29 297. Gleiser, P. M. and Danon, L. (23). Community structure in jazz. Advances in Complex Systems, 6(565). Jackson, M. O. (2). Social and economic networks. Princeton University Press. Jackson, M. O. and Rogers, B. W. (27). Meeting strangers and friends of friends: how random are social networks? American Economic Review, 97(3):89 95. McAuley, J. and Leskovec, J. (22). Learning to discover social circles in ego networks. Advances in Neural Information Processing Systems 25, pages 539 547. Mood, A. M., Graybill, F. A., and Boes, D. C. (974). Introduction to the theory of statistics. McGraw-Hill series in probability and statistics. Mcgraw-Hill College. Rinne, H. (28). The Weibull distribution: A handbook. Chapman and Hall/CRC, st edition.

Smirnov, N. V. (944). Approximate laws of distributions of random variables from empirical data. Uspekhi Mat. Nauk, :79 26. Weibull, W. (95). A statistical distribution function of wide applicability. Journal of Applied Mechanics, 8(3):293 297.