Data Mining Project History in Open Source Software Communities



Similar documents
ModelingandSimulationofthe OpenSourceSoftware Community

Computational Discovery in Evolving Complex Networks

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY

A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS ABSTRACT

Modeling and Simulation of a Complex Social System: A Case Study

The Computer Experiment in Computational Social Science

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY. Jin Xu Yongqin Gao Scott Christley Gregory Madey

COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame

Analysis of Activity in the Open Source Software Development Community

How to Get More Value from Your Survey Data

Open Source Software Developer and Project Networks

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Predict the Popularity of YouTube Videos Using Early View Data

Using Networks to Visualize and Understand Participation on SourceForge.net

A Review of Data Mining Techniques

Incentive Structure on Open Source Software Community: Case Study of GitHub

White Paper April 2006

EST.03. An Introduction to Parametric Estimating

The Coevolution of Mobile OS User Market and Mobile Application Developer Community

Clustering UE 141 Spring 2013

The importance of using marketing information systems in five stars hotels working in Jordan: An empirical study

Warehousing and Studying Open Source Versioning Metadata

11 Application of Social Network Analysis to the Study of Open Source Software

TIM 50 - Business Information Systems

The Importance of Social Network Structure in the Open Source Software Developer Community

Crowd sourced Financial Support: Kiva lender networks

The Data Mining Process

An introduction to Value-at-Risk Learning Curve September 2003

Understanding the Role of Core Developers in Open Source Software Development

Nuggets and Data Mining

An Introduction to Data Mining

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM

Bisecting K-Means for Clustering Web Log data

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

April 2013 NGSS Release Page 1 of 6

Categorical Data Visualization and Clustering Using Subjective Factors

Implementing Portfolio Management: Integrating Process, People and Tools

Confidence Intervals for One Standard Deviation Using Standard Deviation

An Introduction to. Metrics. used during. Software Development

Multichannel Customer Listening and Social Media Analytics

Management Science Letters

Optimization of PID parameters with an improved simplex PSO

Multi-Modal Modeling, Analysis, and Validation of Open Source Software Development Processes

For Your ediscovery... Software

Chapter 6. Linear Programming: The Simplex Method. Introduction to the Big M Method. Section 4 Maximization and Minimization with Problem Constraints

An Overview of Quality Assurance Practices in Agile Methodologies

JBoss Enterprise MIDDLEWARE

Multiple Social Networks Analysis of FLOSS Projects using Sargas

Module 3: Correlation and Covariance

Implementing Industrial Engineering in the Non-Profit Community

Structural Complexity Evolution in Free Software Projects: A Case Study

Security Benefits of Cloud Computing

Evaluation of Xamarin Forms for MultiPlatform Mobile Application Development

Requirements Analysis Concepts & Principles. Instructor: Dr. Jerry Gao

Cloud Based Distributed Databases: The Future Ahead

MSCA Introduction to Statistical Concepts

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

5.5. Solving linear systems by the elimination method

What is Requirements Management?

Private Sector Employment Indicator, Quarter (February 2015 to April 2015)

Application of Predictive Analytics for Better Alignment of Business and IT

Getting the global picture

Empirical Research on Influencing Factors of Human Resources Management Outsourcing Degree *

THE POSSIBILITY OF INCREASING OF LOGISTIC PERFORMANCE FOR SMALL BUSINESS AND DISTRIBUTION COMPANY. Jana Vrlíková 1, Michal Tkáč 2

Oracle Data Integrator and Oracle Warehouse Builder Statement of Direction

SAMPLE DESIGN RESEARCH FOR THE NATIONAL NURSING HOME SURVEY

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Prolifics BPM Methodology 5 Steps to Improve Your Process and Build Your Evidence-Based Business Case

AGENT-BASED MODELING AND SIMULATION OF COLLABORATIVE SOCIAL NETWORKS

Premaster Statistics Tutorial 4 Full solutions

have more skill and perform more complex

THIRD REGIONAL TRAINING WORKSHOP ON TAXATION. Brasilia, Brazil, December 3 5, Topic 4

MEng, BSc Computer Science with Artificial Intelligence

Improving SAS Global Forum Papers

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

Data Mining for Software Process Discovery in Open Source Software Development Communities

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Social Media Mining. Data Mining Essentials

Product data model for PLM system

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

An Automated Testing Tool Using UI Structure

Data Mining Applications in Higher Education

Simple Regression Theory II 2010 Samuel L. Baker

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Veritas Configuration Manager Profile. A Profile Prepared by EMA October 2006

Measures of WFM Team Success

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Absolute Value Equations and Inequalities

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

INTRUSION PREVENTION AND EXPERT SYSTEMS

ENERGY ADVISORY COMMITTEE. Electricity Market Review: Return on Investment

CHOOSE THE RIGHT ONE!

LEAN AGILE POCKET GUIDE

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

Optimal Replacement of Underground Distribution Cables

Chapter 12 Discovering New Knowledge Data Mining

The Universal Laws of Gravitation. Copyright 2012 Joseph A. Rybczyk

Transcription:

Data Mining Project History in Open Source Software Communities Yongqin Gao ygao@nd.edu Yingping Huang yhuang3@nd.edu Greg Madey gmadey@nd.edu Abstract Understanding the Open Source Software (OSS) movement came into focus for many researchers due to the recent fast expansion of OSS communities. SourceForge, which is the data source of this research, is one of the biggest OSS communities. While most of the existing research about OSS communities is focused on the community itself, our research is focused on one of the components in the community the project. With a full database dump from SourceForge, we are able to represent the project history using monthly statistics. Our goal is to find the critical statistical features for the history of the individual projects. Thus based on these features we can generate the rules to predict the future of a project. In this paper, we present the method in three steps. First, we prepare the data for data mining, which include data formatting and feature extraction. Then, we use the non-negative matrix factorization (NMF) method to select independent significant features. In particular, we reduce the size of the feature set from 3 to 0. By using the feature selection method, we can improve the performance of the following step dramatically without sacrificing accuracy. Finally, we use data mining techniques (clustering and summarization) to find the rules and relative features. This research uses the Oracle Data Mining toolkit. By using the method described in this paper, we are able to predict the future of a project with up to 78% confidence. Contact: Yongqin Gao Dept. of Computer Sciences & Engineering Notre Dame, IN 455 Tel: -574-3-759 Fax: -574-3-920 Email: ygao@nd.edu Key Words: Open Source Software, Data Mining, Clustering, Feature Selection

Data Mining Project History in Open Source Software Communities Y. Gao, Y. Huang and G. Madey Open Source Software (OSS) development [Drummond, 999] is a classic example and prototype of collaborative social networks. Open source ( Open source is a certification mark owned by the Open Source Initiative http://www.opensource.org) software, usually created by volunteer programmers dispersed worldwide, now competes with that developed by commercial software firms. This is due in part because closed-source proprietary software has associated risks to users. These risks may be summarized as systems taking too long to develop, costing too much and not working very well when delivered. Open source software offers an opportunity to surmount those disadvantages of proprietary software. SourceForge (http://sourceforge.net) is one of the most famous OSS hosting web sites, which offer features like bug tracking, project management, forum service, mailing list distribution, CVS and more. It has more than eighty thousands developers and fifty thousands projects as of spring 2003. It is a good sample of the OSS community to study the underlying mechanisms of OSS communities. The rest of the paper is structured as follows. Section 2 presents the problem definition and objectives. Section 3 defines the state-of-art of related research. Section 4 discusses the data preparation. Section 5 delves into the feature extraction and data mining, along with some discussion. Finally section is the conclusion. Problem Definition and Objectives Although there is more and more research about OSS movement, most of it is focused on the community itself, which we present in the following section. In this paper, we focused on the history of the individual projects. By inspecting the project history, we are able to discover the critical features that determine the attraction of the projects and the rules to judge the success of a project, and eventually predict the future of a project based on this information. This research into project history of Open Source Software can also help us understand other collaborative network and improve other kinds of software development. We use data mining techniques to investigate the project history. In specific, the techniques include data clustering and data summarization. Our research can be divided into three steps:. Data preparation: Since we can not apply the data mining method on the data stored in the database directly, we have to process the data before the next step. The two main tasks in data preparation is data formatting and feature extraction. 2. Feature selection: Also the data mining techniques we are using are all feature-based. But too many features can downgrade the performance dramatically. Fortunately many of the features are dependent or insignificant. So we will use feature selection to select the independent and most significant features in this step. 3. Data mining: We will use the clustering methods to generate clusters of the projects based on features of their histories. Then summarization will be used to generate the rules. Related Works Research about Open Source Software is growing rapidly recently. There are four major perspectives.. Social science: Recent research include Chan & Lee [Chan & Lee, 2004] and Hemetsberger [Hemetsberger, 2004]. Their paper focus on the economy and business research about Open Source Software. 2. Software engineering: Recent research include Scacchi [Scacchi, 2003] and Elliott [Elliott & Scacchi, 2004]. Their research is mainly the evolution of Open Source Software and its influence to traditional software development. 3. Topology analysis: Recent research include Gao [Gao & Madey, 2003] and Xu [Xu & Gao, 2003]. Their study focuses on understanding the evolution of open source software community from the network perspective. 4. Data mining: Recent research includes Chawla [Chawla, 2003]. This paper is about mining the open source software data using ARN (Association Rules Network). Data Preparation for Data Mining The original table of the project history includes descriptive and statistical attributes for every month and every project which stand for one record. There are totally 24 attributes in every record of the table. We selected 2

statistical attributes in this research, which are listed and explained in Table. The total number of records in the table is 098777. Attribute Developers Downloads Site_views Subdomain_views Page_views File_releases Msg_posted Bug_opened Bug_closed Support_opened Support_closed Patches_opened Patches_closed Artifacts_opened Artifacts_closed Tasks_opened Tasks_closed Help_requests CVS_checkouts CVS_commits CVS_adds Description Number of core developers in the group Number of downloads Number of views of the website for the group Number of views of the subdomain for the group Number of views of pages for the group Number of file releases for the group Number of the messages posted in the group forum Number of bugs opened for development Number of bugs closed for development Number of developing jobs opened for support (feature request) Number of developing jobs closed for support (feature request) Number of jobs opened for patch developing Number of jobs closed for patch developing Number of artifacts opened Number of artifacts closed Number of tasks assigned to developers Number of tasks finished by developers Number of help requested by users Number of CVS checkout activities Number of CVS commit activities Number of CVS add activities Table : Project history statistical attributes We extracted the first monthly values of every attribute for every project which has at least six months history from the table. Specifically, we collected values of every attribute for every project. This -months time series data of every attribute for every project is the history we will investigate. In time series analysis, there are two general aspects trend and seasonality. The former represents a general systematic linear or nonlinear component that changes over time and does not repeat within the time range. The latter represents the pattern that repeats itself in systematic intervals over time. In our research, the time series have just values, so we will just investigate the trend aspect of the time series. Since we used data mining in this research and data mining is feature based analysis, we need to generate features to describe the trends of the time series. In this situation, any single value in the time series is not proper as a feature to describe the trend. So we extracted features to describe the trend instead of using the monthly values of any attribute as the features. We used three features to describe the trends of the time series data for every attribute in every project. These features are the targets of the following data mining. The equations to calculate these features are listed as follows: v = V V / k, F = V F F k, k, k, 2 = ( vk,2 vk,),3 vk,2),4 vk,3),5 vk,4), vk, 5 k,3 = i= v 2 ( j= v k, j ) 2 )

Where k =.. 2 represents different attributes, i =.. and j =.. represent different months, V is the original value of attribute k in the ith month and v are the normalized ith monthly value of attribute k in given project. The baseline of the normalization is the value of attribute k in the first month, which is V k,. So there are 3 features for every attribute in every project. Feature is the starting point of the time series, the original value of the first month. Feature 2 is the accumulative differences between normalized adjacent months. And feature 3 is Pearson s correlation coefficient between the normalized time series data and X coordinate. Thus there are totally 3 features for every project. Feature selection and data mining After data preparation, we use the NMF (Non-Negative Matrix Factorization) method to select independent and significant features. By using feature selection, we actually reduced the size of the feature set from 3 to 0, which significantly improved the program performance without sacrificing the accuracy. The most significant attributes include file_releases, developers, help_requests and task activities. Finally we can start the data mining procedures. First, we will use the clustering method to group the projects by the similarity of their histories. The clustering method we used is K-MEAN. The result is a cluster tree and there are 0 leaf clusters, which represent the fine-grained clusters of the projects. The size of these leaf clusters are listed in Table 2. Cluster ID Size 320 2 322 3 5 4 5 2 422 7 2 8 5 9 0 22 Total 55723 Table 2: Cluster distribution From these clusters, we can summarize the representative feature sets of different project cluster and generate the rules to describe these clusters. Here are two of the rules:. if F file _ releases, 2 in [.050, 20.040] and F task _ closed, 2 in [.238, 2.590] and F help _ requests, 2 in [.385, 9.908] then 2. if F file _ releases, 2 in [20.040, 39.030] and F task _ closed, 2 in [.238, 2.590] and F help _ requests, 2 in [35.477, 44] then 2 Where 2, are cluster ids as in Table 2. To evaluate the correctness of these rules, we also manually clustered part of the data to 4 classes (FAIL, TOP0, TOP500 and NORMAL). FAIL represents the projects finally failed (number of developers is 0) at the end of the sixth month; TOP0 and TOP500 represent the top 0 and top 500 projects at the end of the sixth month; NORMAL represents the other projects. And we evaluate the rules by comparing the automatic clusters to the manual ones. We got the supports and confidences for these rules by evaluation. Following the previous sample rules, the support for rule is 0.995 and confidence is 0.78; and the support for rule 2 is 0.2 and confidence is. After evaluation, one interesting discovery is that all the rules for TOP0 and TOP500 all have very small support (smaller than 0.00). This is because the actually projects belongs to these classes are very few, for example there are just 0 projects belongs to TOP0 while the total project number are over 50,000. So this method is not good for discovering success projects due to its limited popularity. We should use an outliner detection method to

investigate these projects. But we can also find that this method can predict the failed projects by the project history with good confidence. References [Drummond, 999] G. Drummond, 999, "Open Source Software and Documents: A Literature and Online Resource Review" http://www.omar.org/opensource/litreviewa. [Chan & Lee, 2004] Chan, Tzu-Ying & Jen-Fang Lee, 2004, "A Comparative, Study of Online Communities Involvement in Product Innovation and Development" National Cheng Chi University working paper, Taiwan. [Hemetsberger, 2004] Hemetsberger, Andrea, 2004, "Fostering Cooperation on The Internet: Social Exchange Processes in Innovative Virtual Consumer Communities" University of Innsbruck. [Scacchi, 2003] Walt Scacchi, 2003, "When is Free/Open Source Software development faster, better and cheaper than software engineering" Working paper, Institute for Software Research, UC Irvine. [Elliott & Scacchi, 2004] Margaret S. Elliott, Walt Scacchi, 2003, "Free Software Development: Cooperation and Conflict in a Virtual Organzational Cluture" Free/Open Source Software Development, Idea Publishing. [Gao & Madey, 2003] Yongqin Gao & Greg Madey, 2003, "Analysis and Modeling of the Open Source Software Community" NAACSOS, Pittsburgh, 2003. [Xu & Gao, 2003] Jin Xu & Yongqin Gao, 2003, "A Docking Experiment: Swarm and Repast for Social Network Modeling", Agent, Chicago, 2003 [Chawla, 2003] Sanjay Chawla, 2003, "Mining Open Source Software (OSS) Data Using Association Rules Network" University of Sydney, NSW, Australia, 2003