The Computer Experiment in Computational Social Science

Similar documents
A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS ABSTRACT

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY

Modeling the Free/Open Source Software Community: A Quantitative Investigation

ModelingandSimulationofthe OpenSourceSoftware Community

AGENT-BASED MODELING AND SIMULATION OF COLLABORATIVE SOCIAL NETWORKS

THE OPEN SOURCE SOFTWARE DEVELOPMENT PHENOMENON: AN ANALYSIS BASED ON SOCIAL NETWORK THEORY

Open Source Software Developer and Project Networks

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY. Jin Xu Yongqin Gao Scott Christley Gregory Madey

Modeling and Simulation of a Complex Social System: A Case Study

11 Application of Social Network Analysis to the Study of Open Source Software

COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Accelerating Cross-Project Knowledge Collaboration Using Collaborative Filtering and Social Networks

Applying Social Network Analysis to the Information in CVS Repositories

Computational Discovery in Evolving Complex Networks

Data Mining Project History in Open Source Software Communities

IC05 Introduction on Networks &Visualization Nov

Discovering Determinants of Project Participation in an Open Source Social Network

Effects of node buffer and capacity on network traffic

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Towards Modelling The Internet Topology The Interactive Growth Model

Data Mining for Software Process Discovery in Open Source Software Development Communities

Exploring new ways of Usability testing for an E-Science/ Scientific research application

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Temporal Dynamics of Scale-Free Networks

Introduction to Networks and Business Intelligence

An Interest-Oriented Network Evolution Mechanism for Online Communities

Sampling from the Debian GNU/Linux Distribution:

Graph Processing and Social Networks

Graph Theory and Networks in Biology

Chapter 2 Simulation as a method

Analysis of Activity in the Open Source Software Development Community

Social Network Mining

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

Decision Support Optimization through Predictive Analytics - Leuven Statistical Day 2010

Applying Social Network Analysis Techniques to Community-Driven Libre Software Projects

The Swarm Model in Open Source Software Developer Communities

A Social Network Approach to Free/Open Source Software Simulation

Prediction of Business Process Model Quality based on Structural Metrics

Dynamical Simulation Models for the Development Process of Open Source Software Projects

DEVELOPING HYPOTHESIS AND

Complex Networks Analysis: Clustering Methods

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Modeling and Simulating Free/Open Source Software Development Processes

Fairfield Public Schools

Valuation of Network Effects in Software Markets

Graph Mining Techniques for Social Media Analysis

Using Networks to Visualize and Understand Participation on SourceForge.net

A comparative study of social network analysis tools

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Survey Research: Choice of Instrument, Sample. Lynda Burton, ScD Johns Hopkins University

WORKSHOP Analisi delle Reti Sociali per conoscere uno strumento uno strumento per conoscere

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

Appendix B Checklist for the Empirical Cycle

USE OF GRAPH THEORY AND NETWORKS IN BIOLOGY

APPENDIX E THE ASSESSMENT PHASE OF THE DATA LIFE CYCLE

Supporting Knowledge Collaboration Using Social Networks in a Large-Scale Online Community of Software Development Projects

OHJ-1860 Software Systems Seminar: Global Software Development. Open-source software development By Antti Rasmus

College of Arts and Sciences: Social Science and Humanities Outcomes

D A T A M I N I N G C L A S S I F I C A T I O N

Network mining for crime/fraud detection. FuturICT CrimEx January 26th, 2012 Jan Ramon

Data Mining & Data Stream Mining Open Source Tools

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display

Architecture & Experience

Business Intelligence and Process Modelling

An Alternative Web Search Strategy? Abstract

A New Structural Analysis Model for E-commerce Ecosystem Network

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Graph Mining and Social Network Analysis

UNIFY YOUR (BIG) DATA

Mining Archives and Simulating the Dynamics of Open-Source Project Developer Networks

Qualitative Methods in Empirical Studies of Software Engineering. by Carolyn B. Seaman

Cluster detection algorithm in neural networks

Network Algorithms for Homeland Security

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Graffiti Networks: A Subversive, Internet-Scale File Sharing Model. Andrew Pavlo DC401 Rhode Island Defcon Group October 12, 2009

Open Source Software Development

STUDENT THESIS PROPOSAL GUIDELINES

Background Biology and Biochemistry Notes A

Network analysis of the evolution of an open source development community

The Melvyl Recommender Project: Final Report

Online 12 - Sections 9.1 and 9.2-Doug Ensley

The Importance of Social Network Structure in the Open Source Software Developer Community

An investigation into the use of Open Source Software in the not-for-profit sector in Ireland.

Topic #6: Hypothesis. Usage

Virtual Lab 1. Running Head: UTAH VIRTUAL LAB: TEACHING SCIENCE ONLINE. Thomas E. Malloy and Gary C. Jensen. University of Utah, Salt Lake City, Utah

Social Network Analysis: Introduzione all'analisi di reti sociali

Agent-based Modeling of Disrupted Market Ecologies: A Strategic Tool to Think With

The Topology of Large-Scale Engineering Problem-Solving Networks

How To Find Influence Between Two Concepts In A Network

Hype Cycle for Open-Source Technologies, 2003

Copyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation?

Bioinformatics: Network Analysis

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Online Appendix to Social Network Formation and Strategic Interaction in Large Networks

Course Syllabus. BIA658 Social Network Analytics Fall, 2013

Corporate Information & Computing Services. Research Support For Postgraduate Students.

KNOWLEDGE NETWORK SYSTEM APPROACH TO THE KNOWLEDGE MANAGEMENT

HYPOTHESIS TESTING WITH SPSS:

Transcription:

The Computer Experiment in Computational Social Science Greg Madey Yongqin Gao Computer Science & Engineering University of Notre Dame http://www.nd.edu/~gmadey Eighth Annual Swarm Users/Researchers Conference University of Michigan Ann Arbor, Michigan USA May 9-11, 2004 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No. 0222829

Outline Background The epistemological questions Example research question Simulation Computer experiments Discussion

Background Two NFS projects using agent-based simulation 1) Molecules and microbes as agents 2) Free/Open Source Software developers as agents Primarily scientific investigations with IT tool building and simulation support How do you justify the use of simulation? From a philosophy of science perspective (not engineering) what do simulation results tell us?

Why Agent-based Approach for Molecules large (Large ecosystem) (Years) Elemental Cycling Copyright 1998, Thomas M. Terry,TheUniversity of Conn Daisy NOM1.0 Scale (size, temporal) StochSim Connectivity Maps (One molecule) (nanoseconds) small low (Atoms number Percentage) Detail (structure) high (Forces between atoms Electron density)

The Epistemological Questions How do we come to know social science knowledge? What do we (or should we) accept as support for proposition in social science research? Often real experiments are not possible Only one real history Ethical issues What role can simulation play in answering the above? Does simulation have a role beyond fishing expeditions? Simulation just discovers phenomenon for real experiments?

Classical Scientific Method 1. Observe the world a) Identify a puzzling phenomenon 2. Generate a falsifiable hypothesis (K. Popper) 3. Design and conduct an experiment with the goal of disproving the hypothesis a) If the experiment fails,, then the hypothesis is accepted (until replaced) b) If the experiment succeeds,, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated 4. Then add to the body of theory a) A new axiom/law b) A new model c) Then derive new deductions or model conclusions (Note: Realism vs Instrumentalism)

The Computer Experiment

Agent-Based Simulation as a Component of the Scientific Method Hypothesis Conceptual Model Observation Interesting Phenomenon Computer Experiment Agent-Based Simulation

GNU Open Source Software (OSS) Linux Savannah Free to view source to modify to share of cost Examples Apache Perl GNU Linux Sendmail Python KDE GNOME Mozilla Thousands more

Example: F/OSS Study Online data Screen scraping Database dumps Modeling Social network theory Evolutionary assumptions Simulation Verification and validation Computer experiments Variation of Classical Scientific Method

Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorships Terrorist Networks Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon

SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 70,000 Projects 90,000 Developers 800,00 Registered Users

Observations Web mining Web crawler (scripts) Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ DEVELOPER 8001 dev378 8001 dev8975 8001 dev9972 8002 dev27650 8005 dev31351 8006 dev12509 8007 dev19395 8007 dev4622 8007 dev35611 8008 dev8975

F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster Project 7597 dev[64] Project 6882 dev[72] dev[67] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[52] dev[55] dev[58] dev[83] Project 15850 Project 7028 dev[99] dev[51] 15850 dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] 15850 dev[46] 15850 dev[46] dev[56] dev[83] 15850 dev[46] dev[48] dev[48] dev[70] 7597 dev[46] dev[72] dev[56] 7597 dev[46] dev[64] 7597 dev[46] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[59] dev[49] Project 9859 dev[65] dev[45] dev[61]

Topological Analysis of the Data Statistics inspected Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and life cycle Evolution of these statistics Dual networks developer network and project network

Degree Distribution: Developers

An Example Research Question What processes can explain the evolution of the developer social networks? Randomly growing network (Erdos( Erdos-Reyni,, 1960)? Evolving network with preferential attachment (Barabasi( Barabasi- Albert, 1999)? Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? Evolving network with preferential attachment and fitness (Madey et al, 2003)? Can we use the computer experiment to test (falsify?) hypothesis about possible processes in the formation of the F/OSS developer network

Computer Experiments Agent-based simulations Java programs using Swarm class libraries Validation (docking) exercises using Java/Repast SourceForge s (Epstein & Axtell, 1996) Parameterized with observed data, e.g., developer behaviors Join rates New project additions Leave projects Evaluation of multiple models (hypotheses) Grow artificial SourceForge Verification/falsification (simulation and hypothesis) Ensemble averages of time series data Distributions Chi-squared tests t-tests Kolmogorov-Smirnov tests

Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Observation Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Agent -Based Simulation (Experiment) Grow Artificial SourceForge

Model for SourceForge ABM collaborative social network Model description Agent: developer Behaviors: Create, join, abandon and idle Preference: developer s s and project s Fitness Four models in iterations ER, BA, BA with constant fitness and BA with dynamic fitness Comparison of empirical and simulated data

ER Model Degree Distribution Degree distribution is normal distribution while it is power law in empirical data Fit Fails!

BA Model Degree Distribution Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9798 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.6650 and empirical data has R 2 as 0.9838. Partial Fit!

BA Model with Constant Fitness Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9742 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.7253 and empirical data has R 2 as 0.9838. Improved fit!

Discovery: Project Life Cycle

BA Model with Dynamic Fitness Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9695 and empirical data has R 2 as 0.9714. For project distribution: simulated data has R 2 as 0.8051 and empirical data has R 2 as 0.9838. Somewhat better fit!

Models of the F/OSS Social Network (Alternative Hypotheses) General model features Agents are nodes on a graph (developers or projects) Behaviors: Create, join, abandon and idle Edges are relationships (joint project participation) Growth of network: random or types of preferential attachment, formation of clusters Fitness Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific models ER (random graph) - (1960) BA (preferential attachment) - (1999) BA ( + constant fitness) - (2001) BA ( + dynamic fitness) - (2003)

Discussion Is simulation better for falsification, but weaker at confirmation of hypotheses? Under what conditions can simulation results be accepted as confirmation of a hypothesis? Need more validation/verification of simulations Confidence in results Case of computer proofs (four color problem in mathematics) Need for open source/open data For replication of results? For docking and model-2-model comparisons Or is the real value of the simulation for fishing around for developing new hypotheses? Discovery? Hidden relationships/rules-of-operations Hidden features of components Black-box, grey-box, white-box models Discovery by reverse engineering

Summary Why Agent-Based Modeling and Simulation? Can be used as components of the Scientific Method A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social Networks SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. Simulations Computer experiments rejected some and confirmed plausibility of one hypothesis Provided insight into the phenomenon under study and guided data mining of collected observations Provided focus for additional data collection and real experiments.

Thank you