The Efficiency of Open Source Software Development

Similar documents

Profiling an Open Source Project Ecology and Its Programmers

Module 11. Software Project Planning. Version 2 CSE IIT, Kharagpur

Project Planning and Project Estimation Techniques. Naveen Aggarwal

Parametric and Nonparametric: Demystifying the Terms

Estimating the Total Development Cost of a Linux Distribution. By Amanda McPherson, Brian Proffitt, and Ron Hale-Evans

1.- L a m e j o r o p c ió n e s c l o na r e l d i s co ( s e e x p li c a r á d es p u é s ).

Finally, Article 4, Creating the Project Plan describes how to use your insight into project cost and schedule to create a complete project plan.

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Open Source Project Categorization Based on Growth Rate Analysis and Portfolio Planning Methods

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

DESIGN FOR QUALITY: THE CASE OF OPEN SOURCE SOFTWARE DEVELOPMENT

Automating the Measurement of Open Source Projects

Chap 1. Software Quality Management

Classification Problems

Statistics in Retail Finance. Chapter 6: Behavioural models

The Evolution of Mobile Apps: An Exploratory Study

Cross platform Migration of SAS BI Environment: Tips and Tricks

Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market

Game Design From Concepts To Implementation

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Statistical Models in R

Pragmatic Peer Review Project Contextual Software Cost Estimation A Novel Approach

Do Onboarding Programs Work?

CloudVPS Backup Manual. CloudVPS Backup Manual

Open Source ERP for SMEs

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Nexus Professional Whitepaper. Repository Management: Stages of Adoption

Technote: AIX EtherChannel Load Balancing Options

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Web Application s Performance Testing

Linux Distributions. What they are, how they work, which one to choose Avi Alkalay

Simple linear regression

Audit TM. The Security Auditing Component of. Out-of-the-Box

W h a t is m e tro e th e rn e t

Quantitative Methods for Finance

Embedded devices as an attack vector

Normality Testing in Excel

Mining Repositories to Assist in Project Planning and Resource Allocation

Chapter 5. Choose the answer that mostly suits each of the sentences given:

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

EM EA. D is trib u te d D e n ia l O f S e rv ic e

D-optimal plans in observational studies

On Reproducible Econometric Research

A Framework to Represent Antecedents of User Interest in. Open-Source Software Projects

Improved metrics collection and correlation for the CERN cloud storage test framework

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY

Profiling and Testing with Test and Performance Tools Platform (TPTP)

Test Data Management Concepts

IQ MORE / IQ MORE Professional

Article 3, Dealing with Reuse, explains how to quantify the impact of software reuse and commercial components/libraries on your estimate.

FY2003. By Jay F. May. Introduction

Assignment 2: Option Pricing and the Black-Scholes formula The University of British Columbia Science One CS Instructor: Michael Gelbart

Analysis of Open Source Software Development Iterations by Means of Burst Detection Techniques

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Experience with external active fixed income managers - NBIM

Module 11. Software Project Planning. Version 2 CSE IIT, Kharagpur

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Quotient Rings and Field Extensions

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

NETSCC MIS Brief Consultant or Scientific Advisors

Nonparametric tests these test hypotheses that are not statements about population parameters (e.g.,

System Behavior Analysis by Machine Learning

UNIVERSITY OF NAIROBI

Tutorial 5: Hypothesis Testing

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Ubuntu Linux Reza Ghaffaripour May 2008

Documentation and Project Organization

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Predicting progress on your project

Continuously Improve Mobile App Quality: IBM Mobile Quality Assurance

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

Nagarjuna College Of

Obesity in America: A Growing Trend

The Red Hat Enterprise Linux advantages over Oracle Linux

Asymmetry and the Cost of Capital

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Dell Networking ARGOS 24/03/2016. Nicolas Roughol. Networking Sales Engineer. Tel : nicolas_roughol@dell.com

Mining Archives and Simulating the Dynamics of Open-Source Project Developer Networks

Multivariate Normal Distribution

Solutions to old Exam 1 problems

How To Find Influence Between Two Concepts In A Network

Transcription:

The Efficiency of Open Source Software Development Stefan Koch Department of Management Bogazici Üniversitesi Istanbul Koch, S. (2008) "Effort Modeling and Programmer Participation in Open Source Software Projects", Information Economics and Policy,, 20(4): 345-355. 1

Contents 1. Introduction 2. Methodology and Data Set 3. Effort Estimation and Comparison 1. Participation-based estimation 2. Product-based estimation 3. Comparison and results 4. Conclusions and Future Work 2

Introduction it would cost over $1 billion to develop Red Hat 7.1 by conventional proprietary means David A. Wheeler (More Than a Gigabuck: Estimating GNU/Linux's Size, http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.html, 2001) But, what did it really cost, and what does that mean? 3

Introduction Question of efficiency of open source development efficiency denotes relation between input and output in software development, input is effort, output is software How much software did we get for our effort? More or less than investing the same effort in conventional development? Is OS development a waste of resources? Should we adopt the methodology for other areas? Discussion without much empirical basis OS development is fast and cheap, provides high-quality finding bugs late in life-cycle is inefficient huge effort for OS development is spread and hidden... (e.g. IEEE Software 1999) 4

Introduction Idea: comparison between two estimated efforts effort based on participation data effort based on product, using commercial development assumption comparison Input Process Output class student () { method.get(); if (student.name 5

Methodology and Data Set Empirical research in open source analysis needs to be based on quantitative information from a variety of project forms and sizes case studies of well-known projects helpful general assessment outstanding Mining software repositories source code versioning systems, bug reporting systems, mailing lists,... contain wealth of information on underlying software and associated development processes cost-effective (no additional instrumentation necessary) does not influence process longitudinal data available especially for OSS development (open, all information included) 6

Methodology and Data Set SourceForge.net software development and hosting site, variety of services offered to hosted projects statistics published (e.g. activity) not detailed enough Method automatically extract data from web pages generate CVS commands for download and logging parsing of results storage in a database (e.g. size, date, programmer of each checkin) mostly using Perl scripts analyses with statistical package 7

Methodology and Data Set Data from 8,621 projects 7,734,082 commits 663,801,121 LOCs added 87,405,383 deleted 2,474,175 single files 12,395 distinct programmers distribution of assets available, i.e. programmers, and resulting outcome very skewed distribution of effort within projects very skewed (top decile --> 79% of code base, second decile --> 11%) 8

Participation-based Estimation Effort estimation based on actual participation active programmers as base number of persons with >0 commits in time interval (e.g. one month) high correlation with LOC added in month Approaches cumulation of active programmer months manpower function modeling function of people in project over time Conversion necessary in both cases open source person-month not equal commercial person-month several studies, e.g. 18.4 hours per week spent on OSS development (Linux kernel developers) 9

Participation-based Estimation Manpower function modeling Norden-Rayleigh model (first published 1960) for R&D projects in general project is a set of problems to be solved number of problems is unknown but finite finding and solving problems is accomplished by creative effort of applied manpower event of solving a problem is independent and random (follows a Possion distribution) number of persons gainfully employed in a project is dependent on the number of problems known but not solved linear learning rate over time p(t)= 2 a t m(t) = C (t) = 2 K a t exp( a t^2) Rayleigh-type curve (from 0 increasing up to peak point, where K can be computed, and to 0 again) 10

Participation-based Estimation Example: GNOME 2 variants with different starting times no decline visible introduction of new requirements? 11

Participation-based Estimation Manpower function modeling revisited include introduction of new requirements several possible new functions proposed number of additional problems introduced proportional to number of problems in the starting set number of problems introduced proportional to number of problems already solved with different learning rates plus linear and quadratic model overall 9 models all fitted for Sourceforge.net projects (and 2 subsets) and compared one against the other (Wilcoxon signed-rank test with Bonferroni correction) modified functions outperform, best is quadratic learning rate and proportional to starting set 12

Product-based Estimation Estimation models from commercial context COCOMO 81 and COCOMO 2 (with different parameters) function points (several different estimation formulae) overall 6 different estimations per project Product estimated measured in lines-of-code at end of observation function points derived from lines-of-code (using backfiring conversion rate per programming language) current status seen as end product of commercial software development effort 13

Comparison and Results Comparison COCOMO Norden-Rayleigh analytically possible, Norden-Rayleigh used by COCOMO find parameter values within COCOMO to generate Norden- Rayleigh curve from participation for COCOMO 81 not possible, for COCOMO 2 very favourable parameters necessary hints at high efficiency Comparison of all estimations all estimation models applied to projects note different time frames product-based assumes end-product, as does cumulation manpower function modeling does not 14

Comparison and Results N M in. M ax. M e a n S td. D e v. M e d ia n S u m o ve r a ll p ro je c t s N o rd e n -Ra yle ig h 8,6 2 0 0.0 6 2 0 0 0.6 9 3.72 0.19 5,9 6 5 ( e q u a t io n -b a s e d ) N o rd e n -Ra yle ig h 2,1 9 2 0.1 0 1 1,337.49 2 9.0 7 2 9 9.4 7 0.74 6 3,7 1 1 (re g re s s io n ) N o rd e n -Ra yle ig h 9 4 4-2,9 96.65 1,001.53-1 1 7,6 2 2 1 0.2 3 0.11-1 1 1,03 7 m od. e q u a t io n (4 ) N o rd e n -Ra yle ig h 9 4 4-5 0 8.47 5 9,692.56 3 9 6,1 1 2,4 2 2.6 3 0.46 3 7 3,92 5 m od. e q u a t io n (6 ) N o rd e n -Ra yle ig h 6 4 9-2 9 5,9 8 7.4 1 1 4,288.15-2,1 9 7.9 8 2 1,3 5 2.5 7 0.35-1,4 26,49 1 m od. e q u a t io n (8 ) N o rd e n -Ra yle ig h 1,7 4 3-1 4.0 1 2 5.7 7 3.0 6 4.47 3.07 5,3 4 1 m od. e q u a t io n (1 0 ) N o rd e n -Ra yle ig h 7 5 5-2 3.4 4 3 5.0 0 1.4 4 3.31 0.66 1,0 8 9 m od. e q u a t io n (1 2 ) Ac tive p rog ram m e r 8,6 2 0 0.1 0 5 5 0.5 6 1.39 0.12 2,2 2 9 e ff o rt ye a rs COCOMO 8 1 8,6 2 1 0.0 0 4,1 59 1 8.5 6 1 3 3.6 6 2.02 1 6 0,02 0 COCOM O II ( n o m in a l p aram e te rs ) COCOM O II (re a lis t ic p aram e te rs ) Fu n c t io n P o in t (Alb re c h t a n d Ga ffn e y) Fu n c t io n P o in t (Ke m e re r) Fu n c t io n P o in t ( M a t s o n e t a l., lin e ar) Fu n c t io n P o in t ( M a t s o n e t a l., log.) 8,6 2 1 0.0 0 8,1 56 3 1.8 4 2 5 4.7 3 2.76 1 3 5,11 2 8,6 2 1 0.0 0 3,5 39 1 5.6 7 1 1 3.5 5 1.69 2 7 4,48 2 8,6 2 1-7.3 4 3,8 08 1 2.3 6 1 2 6.5 1-4.67 1 0 6,52 8 8,6 2 1-1 0.1 7 3,6 50 8.7 3 1 2 1.3 5-7.61 7 5,2 4 9 8,6 2 1 0.3 2 1,0 69 5.8 4 3 5.42 1.07 5 0,3 1 6 8,5 5 9 0.0 0 8 6 9 4.5 2 2 8.93 0.62 3 8,6 9 5 15

Comparison and Results Results series of Wilcoxon signed-rank tests (also 2 subsets) Norden-Rayleigh significantly below all product-based estimates cumulation significantly below (but different time frame) Interpretation OS development more efficient by using self-selection for task management etc. extremely high amount of non-programmer participation (1:7 relation, chief programmer team organisation) 16

Conclusions and Future Work Efficiency of open source development first results promising but several explanations (efficiency, non-programmer participation and special organisation,...) effort estimation problematic, development of new models necessary effort logging of participants would be interesting significance of results for method adoption in both directions developers would not abandon OS development because of different motivations, but do not want to waste their time either 17

Conclusions and Future Work Future Work inclusion of quality indicators application of Data Envelopment Analysis (DEA) non-parametric efficiency comparison method multi-input, multi-output systems with factors on different scales possibility to include different input factors (programmers, posters, commercial sponsorship,...) and output factors (size, quality, downloads,...) compare OSS projects for influences on efficiency (inequality, licence, tools,...) compare OSS with closed-source projects 18