Computational Discovery in Evolving Complex Networks

Computational Discovery in Evolving Complex Networks Yongqin Gao Advisor: Greg Madey Yongqin Gao December 2006 Dissertation Defense

Outline Background Methodology for Computational Discovery Problem Domain OSS Research Process I: Data Mining Process II: Network Analysis Process III: Computer Simulation Process IV: Research Collaboratory Contributions Conclusion and Future Work

Background Network research gains more attentions Internet Communication network Social network Software developer network Biological network Understanding the evolving complex network Goal I: Search Goal II: Prediction Computational scientific discovery

Computational Discovery Our Methodology Discovery Network Analysis Assessment Researcher Initialization Data Mining Feedback Revision Computer Simulation Research Collaboratory Contribution Reference Community Members

Problem Domain Open Source Software Movement What is OSS Free to use, modify and distribute and source code available and modifiable Potential advantages over commercial software: Potentially high quality; Fast development; Low cost Why study OSS (Goal) Software engineering new development and coordination methods Open content model for other forms of open, shared collaboration Complexity successful example of selforganization/emergence

Glory of OSS Number of Active Apache Hosts

Problem Domain SourceForge.net community The biggest OSS development communities 134,751 registered projects 1,439,773 registered users

Our Data Set Problem Domain 25 monthly dumps since January 2003. Totally 460G and growing at 25G/month. Every dump has about 100 tables. Largest table has up to 30 million records. Experiment Environment Dual Xeon 3.06GHz, 4G memory, 2T storage Linux 2.4.21-40.ELsmp with PostgreSQL 8.1

Related Research OSS research W. Scacchi, Free/open source software development practices in the computer game community, IEEE Software, 2004. C. Kevin, A. Hala and H. James, Defining open source software project success, 24th International Conference on Information Systems, Seattle, 2003. Complex networks L.A. Adamic and B.A. Huberman, Scaling behavior of the world wide web, Science, 2000. M.E.J. Newman, Clustering and preferential attachment in growing networks, Physics Review, 2001.

Process I: Data Mining Related Research: S. Chawla, B. Arunasalam and J. Davis, Mining open source software (OSS) data using association rules network, PAKDD, 2003. D. Kempe, J. Kleinberg and E. Tardos, Maximizing the spread of influence through a social network, SIGKDD, 2003. C. Jensen and W. Scacchi, Data mining for software process discovery in open source software development communities, Workshop on Mining Software Repositories, 2004.

Process I: Data Mining Algorithm Application Feature Selection Relevant data Data Purging Database Data Preparation Raw data

Process I: Data Mining Data Preparation Data discovery Locating the information Data characterization Activity features: user categorization Network features Data assembly Data Purging Treatment about data inconsistency Unifying the date presentation by loading into single depository Treatment about data pollution Removing inactive projects Feature Selection This method is used to remove dependent or insignificant features. NMF (Non-negative Matrix Factorization)

Result I Process I: Data Mining Significant features By feature selection, we can identify the significant feature set describing the projects. Activity features: file_releases, followup_msg, support_assigned, feature_assigned and task related features Network features: degrees, betweenness and closeness

Process I: Data Mining Distribution-based clustering (Christley, 2005) Clustering according to the distribution of features instead of values of individual feature We assume every entity (project) has an underlying distribution of the feature set (activity features) Using statistical hypothesis test Non-parametric test Fisher s contingency-table test is used Joachim Krauth, Distribution-free statistics: an application-oriented approach, Elsevier Science Publisher, 1988.

Process I: Data Mining Procedure: While (still unclustered entities) Put all unclustered entities into one cluster While (some entities not yet pairwise compared) A = Pick entity from cluster For each other entity, B, in cluster not yet compared to A Run statistical test on A and B If significant result Remove B from cluster Worst case complexity: O(n 2 )

Process I: Data Mining Result II Unsupervised learning Distribution-based method used to cluster the project history using the activity distribution We named the clusters using ID and the results are shown in the table High support and confidence in evaluation Cluster ID 1 2 3 Total Size 89709 9191 2060 100960

Process I: Data Mining Two sample distributions from different categories Unbalanced feature distribution could be unpopular Balanced feature distribution could be popular Cluster 1 3488 3500 3000 2500 2000 1641 1510 1500 1000 736 534 500 312 229 20 22 0 82 121 28 0 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Activity Category Cluster 3 10000 9169 9000 8435 8000 7134 7000 6000 5000 4000 3781 3000 2179 2537 2411 2000 1651 1000 431 667 601 399 134 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Activity Category

Process I: Data Mining Discoveries in Process I Significant feature set selection Network features are important Further inspection in next process Distribution based predictor Based on the activity feature distribution Prediction of the popularity based on the balance of the activity feature distribution Benefit of these discoveries For collaboration based communities, these discoveries can help in resource allocation optimization.

Process II: Network Analysis Why network analysis Assess the importance of the network measures to the whole network and to individual entity in the network Inspect the developing patterns of these network measures Network analysis Structure analysis Centrality analysis Path analysis

Process II: Network Analysis Related research: P. Erdös and A. Rényi, On random graphs, Publicationes Mathematicae, 1959. D.J. Watts and S. H. Strogatz, Collective dynamics of small-world networks, Nature, 1998. R. Albert and A.L. Barabάsi, Emergence of scaling in random networks, Science, 1999. Y. Gao, Topology and evolution of the open source software community, Master Thesis, 2003.

Process II: Network Analysis Structure Analysis Understanding the influence of the network structure to individual entities in the network Inspected measures Approximate diameter log( N / z D = log( z / z Approximate clustering coefficient Component distribution 2 1 ) ) 1 + 1 C = 2 ( µ 2 " µ 1)(! 2 "! 1) 1+ µ! (2! " 3! +! ) 1 1 1 1 2 3

Process II: Network Analysis Conversion among C-NET, P-NET and D- NET

Process II: Network Analysis Result I Approximate Diameters D-NET: between (5,7) while network size ranged from 151,803 to 195,744. P-NET: between (6,8) while network size ranged from 123,192 to 161,798. Approximate Clustering Coefficient D-NET: between (0.85, 0.95) P-NET: between (0.65, 0.75)

Process II: Network Analysis Result I

Process II: Network Analysis Centrality Analysis Understanding the importance of individual entities to the global network structure Inspected measures: Average Degrees Degree Distributions Betweenness B( v) Closeness = C( v) =! s# v# t" V $ st ( v) $ 1 st! " d t V G ( v, t )

Process II: Network Analysis Result II Average Degrees Developer degree in C-NET: 1.4525 Project degree in C-NET: 1.7572 Developer degree in D-NET: 12.3100 Project degree in P-NET: 3.8059

Process II: Network Analysis Result II (Degree distributions in C-NET)

Process II: Network Analysis Result II (Degree distributions in D-NET and P-NET)

Process II: Network Analysis Result II Average Betweenness P-NET: 0.2669e-003 Average Closeness P-NET: 0.4143e-005 Normally these two measures yield very small value in large networks (N>10,000).

Process II: Network Analysis Path Analysis Understanding the developing patterns of the network structure and individual entities in the network Inspected measures: Active Developer Percentage Average Degrees Diameters Clustering coefficients Betweenness Closeness

Process II: Network Analysis Result III (Active entities)

Process II: Network Analysis Result III (Average degrees in C-NET)

Process II: Network Analysis Result III (Average degrees in D-NET and P-NET)

Process II: Network Analysis Result III (Diameters in D-NET and P- NET)

Process II: Network Analysis Result III (Clustering coefficients for D- NET and P-NET)

Process II: Network Analysis Result III (Average betweenness and closeness for P-NET)

Process II: Network Analysis Measures D-NET P-NET C-NET Average Degree Yes Yes Yes Diameter Yes Yes N/A Clustering Coefficient Yes Yes N/A Degree Distribution Yes Yes Yes Component Distribution N/A Yes N/A Major Component N/A Yes N/A Average Betweenness Yes Yes N/A Average Closeness Yes Yes N/A Active Entity Size Development Yes Yes Yes Average Degree Development Yes Yes Yes Diameter Development Yes Yes N/A Clustering Coefficient Development Yes Yes N/A Average Betweenness Development Yes Yes N/A Average Closeness Development Yes Yes N/A

Process II: Network Analysis Discoveries in Process II: Measures of structure analysis and centrality analysis all indicate very high connectivity of the network. Measures of path analysis reveal the developing patterns of these measures (life cycle behavior). Benefits of these discoveries High connectivity in a network is an important feature for information propagation, failure proof. Understanding this discovery can help us improve our practices in collaboration networks and communication networks. Understanding the developing patterns of these network measures provides us a method to monitor network development and to improve the network if necessary.

Process III: Computer Simulation Related Research: P.J. Kiviat, Simulation, technology, and the decision process, ACM Transactions on Modeling and Computer Simulation,1991. R. Albert and A.L. Barabási, Emergence of scaling in random networks, Science, 1999. J. Epstein R. Axtell, R. Axelrod and M. Cohen, Aligning simulation models: A case study and results, Computational and Mathematical Organization Theory, 1996. Y. Gao, Topology and evolution of the open source software community, Master Thesis, 2003.

Process III: Computer Simulation Iterative simulation method Empirical dataset Model Simulation Characterization Description Model Adjustment Generation Verification and validation More measures Empirical Data Collection Verification Validation Simulation More methods

Process III: Computer Simulation Previous iterated models (master thesis): Adapted ER Model BA Model BA Model with fitness BA Model with dynamic fitness Iterated models in this study Improved Model Four (Model I) Constant user energy (Model II) Dynamic user energy (Model III)

Process III: Computer Simulation Model I Realistic stochastic procedures. New developer every time step based on Poisson distribution Initial fitness based on log-normal distribution Updated procedure for the weighted project pool (for preferential selection of projects).

Process III: Computer Simulation Average degrees

Process III: Computer Simulation Diameter and CC

Process III: Computer Simulation Betweenness and Closeness

Process III: Computer Simulation Degree Distributions

Process III: Computer Simulation Deficit in the measures

Process III: Computer Simulation Model II New addition: user energy. User energy the fitness parameter for the user Every time a new user is created, a energy level is randomly generated for the user Energy level will be used to decide whether a user will take a action or not during every time step.

Process III: Computer Simulation Degree distributions for Model II

Process III: Computer Simulation Deficit in the measures

Process III: Computer Simulation Model III New addition: dynamic user energy. Dynamic user energy Decaying with respect to time Self-adjustable according to the roles the user is taking in various projects.

Process III: Computer Simulation Degree distributions (Model III)

Process III: Computer Simulation Models Measures Patterns in Data Simulated Patterns Developer Distribution Power Law (large tail) Power Law (small tail) Model I (more realistic distributions) Project Distribution Average Degrees Clustering Coefficient Diameter Power Law (small tail) Increasing Decreasing Decreasing Power Law (large tail) Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing Developer Distribution Power Law (large tail) Power Law (large tail) Model II (constant user energy) Project Distribution Average Degrees Clustering Coefficient Diameter Power Law (small tail) Increasing Decreasing Decreasing Power Law (reasonable tail) Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing Developer Distribution Power Law (large tail) Power Law (large tail) Project Distribution Power Law (small tail) Power Law (small tail) Model III (dynamic user energy) Average Degrees Clustering Coefficient Diameter Increasing Decreasing Decreasing Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing

Process III: Computer Simulation Discoveries in Process III Expanding the network models for modeling evolving complex networks (more parameters) Providing a validated model to simulate the community network at SourceForge.net Benefits of these discoveries Expanded network models can benefit other researchers in complex networks. Validated model for SourceForge.net can be used to study other OSS communities or similar collaboration networks.

Process IV: Research Collaboratory Related Research: G. Chin Jr. and C. Lansing, The biological sciences collaboratory, Mathematics and Engineering Techniques in Medicine and Biological Sciences, 2004. L. Koukianakis, A system for hybrid learning and hybrid psychology, Cybernetics and Information Technologies, Systems and Applications, 2003. NCBI, FlyBase, Ensembl, VectorBase

Process IV: Research Collaboratory What is Collaboratory? An elaborate collection of data, information, analytical toolkits and communication technologies A new networked organizational form that also includes social processes, collaboration techniques and agreements on norms, principles, value, and rules

Process IV: Research Collaboratory

Process IV: Research Collaboratory Data tier - schema design SF0205 Timeline SF0305 SF0405 Every schema is a database dump from the SourceForge.net SF0103 SF0605 SF0505 SF0805 SF0705

Process IV: Research Collaboratory Data tier - connection pool Logic Tier Connection Request Connection Assigner Connection Pool Persistent Link Persistent Link Persistent Link Timeline

Process IV: Research Collaboratory Presentation Tier Various access methods Documentation and references Community support Wiki interface

Logic Tier Process IV: Research Collaboratory Interactive web query system Authorized user can submit query to the back end repository through the web query Results are provided by files with various formats Dynamic web schema browser Authorized user can access the dynamic schema of the repository through the schema browser

Process IV: Research Collaboratory Utilization reports Monthly statistics (June 2006) Total queries submitted: 16,947 Total data files retrieved: 13,343 Total bytes of query data downloaded: 26,684,556,278 Programmable access method Programmable access method should be provided for complicated access Web services planned

Process IV: Research Collaboratory Results in Process IV Designing, implementing and maintaining a research collaboratory for OSS related research. Benefits of these results OSS researchers can access one of the most complete data sets for a OSS community development. By providing the community service to OSS researchers, the collaboratory can help in sparkling, improving and promoting research ideas about OSS.

Contributions Designed and demonstrated a computational discovery methodology to study evolving complex networks using research on OSS as a representative problem domain Understanding the OSS movement by applying the methods. Process I: data mining Identifying significant features to describe a project Using distribution based clustering to generate a distribution based predictor to predict the popularity of a project Process II: network analysis Introducing more complete analysis to inspect more complete data set from SourceForge.net. Discovering high connectivity and possible life cycle behaviors in both the network structure and individuals in the network Process III: computer simulation Introducing more parameters in modeling evolving complex networks Generating a fit model to replicate the evolution of the SourceForge.net community. Process IV: research collaboratory Designing, implementing and maintaining a research collaboratory to host the SourceForge.net data set and provide community support for OSS related researches.

Publications to-date Y. Gao; G. Madey and V. Freeh. Modeling and simulation of the open source software community, ADSC, San Diego, 2005. Y. Gao and G. Madey. Project development analysis of the oss community using st mining, NAACSOS, Notre Dame, 2005. S. Christley; Y. Gao; J: Xu and G. Madey. Public goods theory of the open source software development community, Agent, Chicago, 2004. Y. Gao, Y. Huang and G. Madey, Data Mining Project History in Open Source Software Communities, NAACSOS, Pittsburgh, 2004. J. Xu, Y. Gao, J. Goett and G. Madey, A Multi-model Docking Experiment of Dynamic Social Network Simulations, Agent, Chicago, 2003. Y. Gao, V. Freeh, and G. Madey, Analysis and Modeling of the Open Source Software Community, NAACSOS, Pittsburgh, 2003. Y. Gao, V. Freeh, and G. Madey, Conceptual Framework for Agentbased Modeling and Simulation, NAACSOS, Pittsburgh, 2003. G. Madey; V. Freeh; R: Tynan and Y. Gao. Agent-based modeling and simulation of collaborative social networks, AMCIS, Tampa, 2003. Y. Gao; V. Freeh and G. Madey. Topology and evolution of the open source software community, SwarmFest, Notre Dame, 2003.

Publication Plan Chapter III (data mining) Journal of Machine Learning Research Journal of Systems and Software Chapter IV (network analysis) Journal of Network and Systems Management Journal of Social Structure Chapter V (computer simulation) Spring Simulation Conference 2007 (under review) IEEE Computing in Science and Engineering Chapter VI (research collaboratory) CITSA 2007 Journal of Computer Science and Applications

Conclusion and Future Work Cyclic computational discovery method for studying evolving complex networks Study of Open Source Software by applying this method Future works: Maintaining and expanding the collaboratory Verifying the discoveries in the SourceForge.net against further accumulated database dump from SourceForge.net Applying our simulation model on other software development communities Extending our methodology to other evolving complex networks like Internet, communication network and various social networks

Acknowledgement My advisor: Dr. Madey My committee members: Dr. Flynn Dr. Striegel Dr. Wood My Colleagues: Scott Christley, Yingping Huang, Tim Schoenharl, Matt Van Antwerp, Ryan Kennedy, Alec Pawling and Jin Xu SourceForge.net managers: Jeff Bates, VP of OSTG Inc. Jay Seirmarco, GM of SourceForge.net. US NSF CISE/IIS-Digital Society & Technology, under Grant No. 0222829.

Questions

Case Study II Project 6882 OSS Developer Network (Part) Developers are nodes / Projects are links 24 Developers 5 Projects 2 hub Developers 1 Cluster dev[72] Project 7597 dev[64] dev[67] dev[52] Project 7028 dev[70] dev[65] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[55] dev[58] dev[99] dev[51] 15850 dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] 15850 dev[46] 15850 dev[46] dev[56] dev[83] 15850 dev[46] dev[48] 7597 dev[46] 7597 dev[46] dev[64] 7597 dev[46] dev[72] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[45] dev[61] dev[49] dev[83] Project 15850 dev[48] dev[56] dev[59] Project 9859

Process I: Data Mining Characteristics of data set Massive Incomplete, noisy, redundant Complex structures, unstructured Classic analysis tools are often inadequate and inefficient for analyzing these data, especially in exploratory research What is DM (Data mining) Nontrivial extraction of implicit, previously unknown and potentially useful information from data.

Process I: Data Mining Feature Selection Given a non-negative n x m matrix V, find factors W (n, r) and H (r, m), such that V W *H This is called the non-negative matrix factorization (NMF) of the matrix V NMF can be used on multivariate data to reduce the dimension of the data set By using NMF, we can reduce dimension from m features to r features

Why NMF? Feature extraction methods linear methods are simpler and more completely understood. nonlinear methods are more general and more difficult to analyze. Linear methods: ICA: Independent Component Analysis Matrix decomposition: PCA, SVD, NMF In practice, NMF is most popular and simple. Dimensionality reduction is effective if the loss of information due to mapping to a lowerdimensional space is less than the gain due simplifying the problem.

Process I: Data Mining Feature-based Clustering Grouping data into K number of clusters based on features. The distance metrics used is Euclidean distance like Hierarchical K-Means is used. The result is a binary tree. The root is the whole data set and the leaf clusters are the fine-grained clusters, which are the resulting K clusters.

Process I: Data Mining Case Study Result II Unsupervised learning K-Means method used to cluster the project history using the features we selected We named the clusters using ID and the results are shown in the table The result is not acceptable by evaluation Cluster ID 1 2 3 4 5 6 7 8 9 10 Total Size 6201 98 64824 2 4 29724 4 10 9 84 100960

Process I: Data Mining Other tables artifact table Forum table People_job table Project_task table Doc_data table User_group table UNION User_project_act table Admin_flags? No Grantcvs? No Assigned? Yes No Yes Activities? Yes No Yes Administrator Core developer Co-developer Active user lurker

Process I: Data Mining

Clustering Result Evaluation Evaluation test set generation Popular/unpopular projects Stratified sampling to make 500 projects Feature sets used Popular feature set Activity Feature set (Page 34, Table 3.2) Network Feature set (Page35, Table 3.3) Generating rules for the test sets Calculating the support and confidence value

Popularity Definition Feature Developers Downloads Site_views Subdomain_views Page_views Description Number of core developers Number of downloads Number of views of the website Number of views of the subdomain Number of views of the pages

Why K-MEAN? The algorithm has remained extremely popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. K-Means is most successful algorithm in large data set (size>1000, dimension > 2) than GA and Evolution CLIQUE is sensitive to noise CURE is not scalable O(n 2 logn) CLARANS & BIRCH are not good for high dimension data D. Arthur, S. Vassilvitskii (2006): "How Slow is the k- means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).

K-MEAN It maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the result has a global minimum of variance. Multiple run is needed. Elbow criterion

Distribution Categories Category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Feature File release New message Followup message Artifact request Todo request Support request Feature request Patch request Bug reports Bug assigned Patch assigned Feature assigned Support assigned Todo assigned Artifact assigned

Process III: Computer Simulation Start Simulation model procedure New Users User List User_Project Links Project List User Action Project Pool Update Idle Drop Create Join Weighted Project Pool End of Simu? No Yes Stop

Process III: Computer Simulation Poisson Process: It expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event. PDF: e F( k;!) = k! k! "!

Process III: Computer Simulation Log-normal distribution:

Process III: Computer Simulation Kolmogorov-Smirnov test Used to determine whether two underlying onedimensional distributions differ. Two one-sided K-S test statistics are given by D D + n! n = max( F n ( x) = max( F( x)! F! F( x)) n ( x))

Process III: Computer Simulation

Similar Publications Chapter III (data mining) JMLR: G. Hamerly, E. Perelman..Using machine learning to guide simulation (Feb. 2006) JSS: S. Kim, J. Yoon..Shape-based retrieval in time-series database (Feb. 2006) Chapter IV (network analysis) JNSM: Special Issue Self-Managing Systems and Networks JoSS: The Journal of Social Structure (JoSS) is an electronic journal of the International Network for Social Network Analysis (INSNA) Chapter V (computer simulation) SSC 2007: simulation co IEEE/CSE: E. Luijten..Fluid simulation with monte carlo algorithm (2006 Vol. 8, Issue 2) Chapter VI (research collaboratory) CITSA 2007: L. Koukianakis..A system for hybrid learning and hybrid psychology (2005) JCSA: S. Chen, K. Wen..An Integrated System for Cancer-Related Genes Mining from Biomedical Literatures (2006)