Collecting Network Data in Surveys

Similar documents
Part 2: Community Detection

Social Media Mining. Graph Essentials

Social Media Mining. Network Measures

Predicting Influentials in Online Social Networks

Statistical and computational challenges in networks and cybersecurity

Chapter 8: Quantitative Sampling

Imputation of missing network data: Some simple procedures

SGL: Stata graph library for network analysis

Social and Economic Networks: Lecture 1, Networks?

How To Understand The Network Of A Network

Farm Business Survey - Statistical information

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS

IMPACT EVALUATION: INSTRUMENTAL VARIABLE METHOD

1 Example of Time Series Analysis by SSA 1

Notes on using capture-recapture techniques to assess the sensitivity of rapid case-finding methods

Practical Graph Mining with R. 5. Link Analysis

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

GUIDELINES FOR REVIEWING QUANTITATIVE DESCRIPTIVE STUDIES

Statistics 2014 Scoring Guidelines

Data Collection and Sampling OPRE 6301

Why Sample? Why not study everyone? Debate about Census vs. sampling

Introduction to Quantitative Research Contact: tel

Survey Research. Classifying surveys on the basis of their scope and their focus gives four categories:

Transferability of Economic Evaluations in Clinical Trials

Monitoring & Evaluation for Results. Reconstructing Baseline Data for Monitoring & Evaluation -Data Collection Methods-

Inclusion and Exclusion Criteria

Network Tomography and Internet Traffic Matrices

April 4th, Flow C was 9 trillion dollars, Flow G was 2 trillion dollars, Flow I was 3 trillion dollars, Flow (X-M) was -0.7 trillion dollars.

Follow links for Class Use and other Permissions. For more information send to:

Software Engineering. Introduction. Software Costs. Software is Expensive [Boehm] ... Columbus set sail for India. He ended up in the Bahamas...

Non-random/non-probability sampling designs in quantitative research

2. Issues using administrative data for statistical purposes

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

NON-PROBABILITY SAMPLING TECHNIQUES

An Introduction to. Metrics. used during. Software Development

The Open University s repository of research publications and other research outputs

Topology-based network security

Protein Protein Interaction Networks

Introduction Qualitative Data Collection Methods... 7 In depth interviews... 7 Observation methods... 8 Document review... 8 Focus groups...

Techniques for data collection

Reflections on Probability vs Nonprobability Sampling

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

1 o Semestre 2007/2008

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Incidence and Prevalence

SAMPLING & INFERENTIAL STATISTICS. Sampling is necessary to make inferences about a population.

DATA ANALYSIS II. Matrix Algorithms

National Survey of Franchisees 2015

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Walk-Based Centrality and Communicability Measures for Network Analysis

Mining Social-Network Graphs

Department of Cognitive Sciences University of California, Irvine 1

University of Saskatchewan Department of Economics Economics Homework #1

Course on Social Network Analysis Graphs and Networks

Integrating Cell Phone Numbers into Random Digit-Dialed (RDD) Landline Surveys

A Basic Introduction to Missing Data

Graph/Network Visualization

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Child Selection. Overview. Process steps. Objective: A tool for selection of children in World Vision child sponsorship

Graph Theory and Networks in Biology

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

360 feedback. Manager. Development Report. Sample Example. name: date:

Binomial Sampling and the Binomial Distribution

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Using Proxy Measures of the Survey Variables in Post-Survey Adjustments in a Transportation Survey

IPDET Module 6: Descriptive, Normative, and Impact Evaluation Designs

SAMPLING DISTRIBUTIONS

Statistical Methods 13 Sampling Techniques

Dynamics of information spread on networks. Kristina Lerman USC Information Sciences Institute

Introduction to Sampling. Dr. Safaa R. Amer. Overview. for Non-Statisticians. Part II. Part I. Sample Size. Introduction.

Poverty Indices: Checking for Robustness

Clinical Study Design and Methods Terminology

How to achieve excellent enterprise risk management Why risk assessments fail

Option 1: empirical network analysis. Task: find data, analyze data (and visualize it), then interpret.

The Shape of the Network. The Shape of the Internet. Why study topology? Internet topologies. Early work. More on topologies..

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Marketing Mix Modelling and Big Data P. M Cain

Topology. Shapefile versus Coverage Views

Identification of Influencers - Measuring Influence in Customer Networks

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Rainfall Insurance and Informal Groups

ONS Methodology Working Paper Series No 4. Non-probability Survey Sampling in Official Statistics

UNIVERSITY OF WAIKATO. Hamilton New Zealand

How to Identify Real Needs WHAT A COMMUNITY NEEDS ASSESSMENT CAN DO FOR YOU:

Social Network Mining

In order to describe motion you need to describe the following properties.

B490 Mining the Big Data. 2 Clustering

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

Least-Squares Intersection of Lines

6/15/2005 7:54 PM. Affirmative Action s Affirmative Actions: A Reply to Sander

INFO 7470 Statistical Tools: Hierarchical Models and Network Analysis. John M. Abowd and Lars Vilhuber May 2, 2016

A Non-Linear Schema Theorem for Genetic Algorithms

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

A.II. Kernel Estimation of Densities

Social Networks and Social Media

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Network Analysis For Sustainability Management

Risk Mitigation Strategies for Critical Infrastructures Based on Graph Centrality Analysis

My work provides a distinction between the national inputoutput model and three spatial models: regional, interregional y multiregional

Transcription:

Collecting Network Data in Surveys Arun Advani and Bansi Malde September 203 We gratefully acknowledge funding from the ESRC-NCRM Node Programme Evaluation for Policy Analysis Grant reference RES-576-25-0042.

Objectives Lots of interest in collecting data on social networks, or measures related to social environments (e.g. social norms, social pressure, etc) Interest in answering questions of the following type: What features of social networks inuence the adoption and usage of new technologies? Does the eectiveness of policy interventions depend on the underlying social structure? Does information on new health practices given to a small subset the village diuse to the rest of the village? How does this diusion take place? Want to make sure we collect the best possible data for this purpose, but within budget constraints

Meeting Structure What do we know about collecting networks data? Findings from Advani and Malde (203), which surveys the literature on empirical methods for identifying and estimating social eects using network data. By network data, we mean information on exact mapping of connections (which allows you to construct a network graph). Special focus on: Options for collecting network data Measurement error in network measures

Advani and Malde (203): Network Methods Review Brings together and reviews literature on empirical methods for identifying and estimating social eects using network data. First considers how networks data can be collected, and common sampling methods 2. Methods for estimating social eects, taking the network to be pre-determined or exogenously formed 3. Dealing with endogenous network formation 4. Measurement error in network data (sampling induced and other) Will only present ndings related to () and (4).

Why we may want to collect detailed network data Eases identication Can get around the reection problem quite easily Can get more detailed measures of features of the underlying social structure that inuence outcomes For example, network centrality measures oer a way of identifying key households in a network Expansiveness of a network measures its fragility (i.e. how likely it is to break into two or more factions) Impose weaker assumptions on the underlying interaction structure than with more aggregated data E.g. all members of a peer group are uniformly linked with one another

Collecting Networks Data Involves collecting information on two inter-related objects - nodes and edges (or links) - within a pre-dened network Nodes = individuals, households, rms (i.e. the economic agent of interest) Edges/Links = connections between nodes (e.g. nancial transactions, family links, friendships, neighbours, etc) Graph First need to decide on the network to measure Depends on the social dimension of interest Friends, family, households one transacts/chats with, etc Careful thinking needed on whether the measured network is that most relevant to the question of interest. In practice, researchers need impose geographic boundaries on the scope of the network No clear guidance on how to choose this

Collecting Networks Data Common ways of collecting networks data include: Direct elicitation from nodes Collection from existing sources Imputing links from existing information on, e.g. group memberships, naming conventions, etc Spatial networks constructed from GIS data

Direct Elicitation from Nodes Ask nodes to list all nodes they interact with on a given dimension (e.g. borrow/lend rice) A variant asks them to list all interactions with nodes on a specied list. Related method asks nodes for information on their links, and the connections of their links A few other things to decide on: What network to elicit information on Actual or potential interactions, e.g. who a household borrowed rice from, or who could they borrow rice from if they needed it? Existence of links only or strength of links Existence = e.g. Who do you chat with [X]? Strength = e.g. How often do you chat with [X]? Recall duration: Trade-o between recall error and frequency of interaction

Collecting Networks Data Types of data that one could have, in terms of network coverage include: Census of all nodes and links Census of all nodes, and sample of links from each node Random sampling Random sample of nodes and all links between sampled nodes (induced subgraph) Random sample of nodes and all links of sampled nodes (star subgraph) Random sample of links and nodes of sampled links Link tracing and snowball sampling

Census of all nodes and all links Collect information on the full network Pros: Cons: Obtain accurate picture of whole network, including local neighbourhoods Allows accurate computation of any desired network statistics Expensive Infeasible for large networks

Census of nodes and sample of links Information on all nodes, but only a (possibly non-random) sample of links Pros: Cons: Censoring of # of links that can be reported in a survey Common in practice Relatively accurate measure of network, and typically possible to do some correction for this depending on extent of censoring Relatively expensive Infeasible for large networks Some measurement error introduced by censoring

Random Sample of Nodes or Links Construct network based on links collected from a random sample of nodes Pros: Cons: Similarly for random sampling of links (ignored here since not commonly used in Economics) In practice, random sample of nodes is drawn from some list of all nodes in the network, e.g. census of households in a village Cost-eective Whole network structure not observed = measurement error in network structure Generates a non-random sample of links = measurement error is non-classical Standard statistical results for inference may not hold

Snowball Sampling Used to obtain data on hard-to-reach populations Data collected via the following process: Start with an initial (possibly random) sample of nodes from the sample of interest, N 0. Elicit all or a sample of the links of these nodes, denoted L. New nodes are called N Then interview the nodes the initially sampled nodes (N 0 ) Pros: Cons: report being linked to (N ) and elicit their links L 2 And so on until a specic target is reached Cheap way of getting accurate information on a node's local network neighbourhood Useful for hard-to-reach populations Even if you start with a random initial sample of nodes, end up with a non-random sample of nodes and links Whole network usually not sampled = measurement error in network structure

Measurement Error and why we should worry about it Two types of measurement error Sampling induced Measuring two interrelated objects - nodes and edges Collecting links (nodes) from a random sample of nodes (links) generates a non-random sample of links (nodes) So random sampling does not generate an accurate picture of the network structure or of related statistics Missing data generates non-classical measurement error Non-sampling measurement error Censoring of the number of links, e.g. surveys may ask individuals to report their 3 best friends Boundary specication, i.e. incorrectly restricting a network to be within a village, for instance Miscoding, misreporting, non-response, etc

Sampling Induced Measurement Error Broad facts emerging from literature on using sampled network data to compute network statistics and parameter estimates:. Network statistics computed from samples containing moderate (30-50%) and even relatively high (~70%) proportions of a network can be highly biased. Sampling a higher proportion of the network generates more accurate network statistics. 2. Measurement error due to sampling varies with the underlying topology. 3. The magnitude of error in network statistics due to sampling varies with the sampling method. E.g. random sampling of nodes and edges under-estimates the proportion of nodes with high numbers of connections, while snowball sampling over-estimates them (since it misses nodes with low number of edges). 4. Parameters in economic models using mismeasured network statistics are subject to substantial bias and inconsistency.

Sampling Induced Measurement Error Bias in network statistics and parameter estimates in star and induced subgraphs Statistic Network Statistic Parameter Estimates Star Induced Star Induced Avg. degree - - + + Avg. path length not known + - - Spectral gap ambiguous ambiguous - - Clustering coe - none/small + - Avg graph span not known + +/- +/- In-Degree small - - + Out-Degree - - - + Degree Centrality not known - not known not known Betweenness centrality and Bonacich centrality distributions are uncorrelated with truth under random sampling of nodes Magnitude of bias in parameter estimates can range from -00% to over 20% and may even lead to dierent signs.

Non-Sampling Measurement Error Consequences include: Top-coding and boundary mis-specication lead to under-estimation of average degree and over-estimation of average path length; no bias in the clustering coecient Missing and spurious nodes and miscoded edges don't bias degree or eigenvector centrality But, results above come from simulations which assume random errors, which may not match the actual missing data process.

Dealing with Measurement Error Two broad ways of dealing with measurement error: Analytical Corrections Correct network statistics using inverse probability weighting estimators Applies to network statistics that can be expressed in terms of sums And for sampling designs where the probability of being in the sample can be calculated: Random sampling - can calculate exact sample inclusion probabilities for non-sampled quantity Snowball sampling - exact probability can't be calculated, but there are methods to estimate these.

Dealing with Measurement Error Model-Based Corrections Involves computing a statistical network formation model to predict missing links Leading models are Exponential Random Graph Models (ERGMs) Estimate a separate model per network Need information on some characteristics that predict link existence for all nodes (e.g. head age, education, land ownership, wealth, caste, etc)

Summary Summarised methods for collecting networks data Direct elicitation What network to collect, boundary specication, etc Impute from existing data Discussed ways of sampling Necessary since it is costly and often infeasible to conduct a complete census But, sampling leads to substantial measurement error and bias in network statistics measures and parameter estimates Ways of dealing with measurement error Analytic corrections Model based corrections - need some information on the non-sampled nodes to be able to predict links

Graph Theory Basics Mathematically, networks are represented as graphs. Graphs contain nodes and edges (links) Graphs can be directed or undirected; Links can be weighted (e.g. with frequency of interaction). Directed, unweighted graph G=(N,E) j N is j is a node in the network; ij E if i and j are linked.

Graph Theory Basics Network links can also be represented by an adjacency matrix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 or 0 2 2 0 0 3 0 3 0 3 4 4 0 4 4 0 0 0 0 0 2 2 0 0

Network Representation and Measures Work in other disciplines has shown that features of the graph provide measures for concepts such as power, etc Well-known graph measures include: Degree: Number of other nodes that a node is linked with Clustering: If i is linked with j and k, what is the probability that j and k are also linked with one another? Geodesic: Shortest path (number of links) that i has to travel through to get to j Diameter: largest geodesic between any 2 links in the network Centrality: captures a node's position in a network Degree centrality how connected a node is Closeness centrality how easily a node can reach others Betweenness the importance of a node in connecting others Eigenvector centrality - how connected a node is to other important nodes Back