A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY



Similar documents
Open Source Software Developer and Project Networks

ModelingandSimulationofthe OpenSourceSoftware Community

Computational Discovery in Evolving Complex Networks

THE OPEN SOURCE SOFTWARE DEVELOPMENT PHENOMENON: AN ANALYSIS BASED ON SOCIAL NETWORK THEORY

COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame

Discovering Determinants of Project Participation in an Open Source Social Network

On Clusters in Open Source Ecosystems

A Framework to Represent Antecedents of User Interest in. Open-Source Software Projects

A Social Network Approach to Free/Open Source Software Simulation

Modeling and Simulating Free/Open Source Software Development Processes

Bug Fixing Process Analysis using Program Slicing Techniques

Social Network Dynamics for Open Source Software Projects

Introduction to Networks and Business Intelligence

Graph Mining Techniques for Social Media Analysis

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Mining Archives and Simulating the Dynamics of Open-Source Project Developer Networks

Network/Graph Theory. What is a Network? What is network theory? Graph-based representations. Friendship Network. What makes a problem graph-like?

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

A discussion of Statistical Mechanics of Complex Networks P. Part I

Scientific Collaboration Networks in China s System Engineering Subject

Master Thesis. The influence of social network structure on the chance of success of Open Source software project communities

Intelligent Analysis of User Interactions in a Collaborative Software Engineering Context

Proposed Application of Data Mining Techniques for Clustering Software Projects

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

MINFS544: Business Network Data Analytics and Applications

How the FLOSS Research Community Uses Archives

An Interest-Oriented Network Evolution Mechanism for Online Communities

Complex Networks Analysis: Clustering Methods

Distance Degree Sequences for Network Analysis

Emergence of New Project Teams from Open Source Software Developer Networks: Impact of Prior Collaboration Ties

PUBLIC TRANSPORT SYSTEMS IN POLAND: FROM BIAŁYSTOK TO ZIELONA GÓRA BY BUS AND TRAM USING UNIVERSAL STATISTICS OF COMPLEX NETWORKS

DESIGN FOR QUALITY: THE CASE OF OPEN SOURCE SOFTWARE DEVELOPMENT

Open Source ERP for SMEs

Bug management in open source projects

The Impact of Release Management and Quality Improvement in Open Source Software Project Management

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Data Mining Techniques Chapter 6: Decision Trees

Graph Mining and Social Network Analysis

Effects of node buffer and capacity on network traffic

Course Syllabus. BIA658 Social Network Analytics Fall, 2013

Social Network Mining

Graph theoretic approach to analyze amino acid network

The mathematics of networks

Transcription:

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY Jin Xu,Yongqin Gao, Scott Christley & Gregory Madey Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 HICSS 38, Hawaii January 2005 Partially Supported by NSF, CISE/IIS-Digital Science and Technology

INTRODUCTION FLOSS: Complex/Self-organizing System Characteristic Structural/Topological Properties Size distributions Connectivity Social Networks Small-world Models Kevin Bacon Number Erdos Numbers Collaboration Networks Quantitative Structural/Topological Study of Project Communities Most developers participate on one project Some developers participate on more than one project Extension on prior study, using newer expanded data

PREVIOUS RESEARCH Social Networks Wasserman, S. Faust, K., Social Network Analysis, 1994. Watts, D., Small Worlds, 1999. Newman, M.E.J. "Clustering and Preferential Attachment in Growing Networks," 2001. Barabasi, A. L., Linked, 2003 Gao Y. Q., Vincent F., Madey G., "Analysis and Modeling of the Open Source Software Community", North American Association for Computational Social and Organizational Science (NAACSOS 2003), Pittsburgh, PA, 2003. Madey G., Freeh V., and Tynan R., "Modeling the F/OSS Community: A Quantitative Investigation," in Free/Open Source Software Development, ed., Stephan Koch, Idea Publishing, 2004.

PREVIOUS RESEARCH (cont.) Web Data Mining Xu J., Huang Y., Madey G., "A Research Support System Framework for Web Data Mining", Workshop on Applications, Products and Services of Web-based Support Systems at the Joint International Conference on Web Intelligence (2003 IEEE/WIC) and Intelligent Agent Technology, Halifax, Canada, October 2003. Howison J. and Crowston K.", "The perils and pitfalls of mining SourceForge", Proceedings of Mining Software Repositories Workshop, International Conference on Software Engineering (ICSE 2004), Edinburgh, Scotland, 2004.

PREVIOUS RESEARCH (cont.) Roles Classification Nakakoji K., Yamamoto Y., Kishida K., Ye Y., "Evolution Patterns of Open-source Software Systems and Communities", Proceedings of The International Workshop on Principles of Software Evolution, Orlando Florida, May 19-20, 2002. Xu N., An Exploratory Study of Open Source Software Based on Public Project Archives, Thesis, The John Molson School of Business, Concordia University, Canada, 2003.

OSS COMMUNITY User Group Passive Users: no direct attributable contribution in the data (downloads, user base, word-of-mouth publicity, etc.) Active Users: bug reports, patch submissions, feature requests, help requests, etc. Developer Group Peripheral Developer: irregularly contribute Central Developer: regularly contribute Core Developer: extensively contribute, manage CVS releases and coordinate peripheral developers and central developers. Project Leader: guide the vision and direction of the project.

OSS DEVELOPMENT COMMUNITY Project Leaders Active Users Core Developers Co-developers

OSS SOCIAL NETWORK Two Entities: developer (community member) & project Project-Developer Network (bipartite graph) Nodes: project, developer Edges: developers in a project are connected to that project

PROJECT-DEVELOPER NETWORK

OSS SOCIAL NETWORK (cont.) Project Network (unipartite graph) Node: project Edge: two projects are connected if there is a developer participating on both. Developer Network (unipartite graph) Node: developer Edge: developers participating in a common project are connected to each other

OSS DEVELOPER NETWORK

DATA COLLECTION & EXTRACTION Data Source: SourceForge 2003 database dump, plus earlier web-mined data SourgeForge.net approaching: 100,000 registered projects 1,000,000 registered users Project Leaders & Core Developers Identification explicitly stored in data dump Co-developers & Active Users Identification indirectly available Forums: ask and/or answered questions Artifacts: bug reports, patch submissions, feature requests, help requests, etc.

SUBSET OF THE DATABASE SCHEMA

Analysis: SourceForge.net Level

MEMBER DISTRIBUTION Project Size Project Count Project Leaders Core Developers Codevelopers Active Users! 88 64847 47.8% 20.6% 19.8% 11.8% <! 88 279 193 2.1% 5.7% 60.3% 31.7% < 279 70 0.9% 2.7% 55.8% 40.6%

TOPOLOGICAL PROPERTIES Degree Distribution The total number of links connected to a node Relative frequency of each index value Power law distribution Diameter The maximum longest shortest-path The average longest shortest path Cluster A social network consists of connected nodes Clustering Coefficient The ratio of the number of links to the total possible number of links among its neighbors

FOUR SETS Subset A = { project leaders} Subset B = {project leaders} U {core developers} Subset C = {project leaders} U {core developers} U {co-developers} Subset D = {project leaders} U {core developers} U {co-developers} U {active users} Prior work looked at Subset B only

DEVELOPER DEGREE DISTRIBUTION Subset A Subset B

DEVELOPER DEGREE DISTRIBUTION Subset C Subset D

REGRESSION PARAMETERS A B C D Project- Network R-squared 0.9396 0.9704 0.6905 0.7221 Slope -3.5841-2.6968-1.3020-1.2220 Developer -Network R-squared 0.9870 0.9846 0.9469 0.9830 Slope -3.3747-3.4676-3.7793-3.2743

DEGREE DISTRIBUTION Prior research suggests that OSS community network is a scale free network growing by two rules: Sequential addition of new developers Preferential attachment Related research on mechanisms that could plausibly generate observed topologies All degree distributions are skewed Most community members participate on 1 project Linchpin members join multiple projects The largest number of communities a member joins increases from 29 (A) to 95 (D)

Diameter length Subset A N/A DIAMETER Subset B 10.2 out of 83118 members Subset C 2.7 out of 139570 members Subset D 2.7 out of 161691 members Project leader network is highly disconnected The degree of separation is significantly decreased with the participation of codevelopers and active users.

CLUSTERS & CLUSTERING COEFFICIENT Property A B C D Largest cluster 737 15091 30794 40175 2 nd largest cluster 197 34 20 20 # of clusters 43826 34280 27983 21659 Clustering coefficient 0.8406 0.8078 0.8867 0.8297 All linked projects form a project cluster The largest cluster is much bigger than the 2 nd largest cluster The largest cluster grows from A to D High clustering coefficient on all 4 subsets because members are fully collected in each project

A PROJECT CLUSTER

DISCUSSION & FUTURE WORK Small World Phenomenon Small diameter & high clustering coefficient Scale Free Power law distribution Effect of Co-developers & Active Users Agent-based Simulation More OSS Web Sites: Apache, Bio.org, Savannah, etc.

THANK YOU