Computational Discovery in Evolving Complex Networks

Size: px
Start display at page:

Download "Computational Discovery in Evolving Complex Networks"

Transcription

1 Computational Discovery in Evolving Complex Networks Yongqin Gao Advisor: Greg Madey Yongqin Gao December 2006 Dissertation Defense

2 Outline Background Methodology for Computational Discovery Problem Domain OSS Research Process I: Data Mining Process II: Network Analysis Process III: Computer Simulation Process IV: Research Collaboratory Contributions Conclusion and Future Work

3 Background Network research gains more attentions Internet Communication network Social network Software developer network Biological network Understanding the evolving complex network Goal I: Search Goal II: Prediction Computational scientific discovery

4 Computational Discovery Our Methodology Discovery Network Analysis Assessment Researcher Initialization Data Mining Feedback Revision Computer Simulation Research Collaboratory Contribution Reference Community Members

5 Problem Domain Open Source Software Movement What is OSS Free to use, modify and distribute and source code available and modifiable Potential advantages over commercial software: Potentially high quality; Fast development; Low cost Why study OSS (Goal) Software engineering new development and coordination methods Open content model for other forms of open, shared collaboration Complexity successful example of selforganization/emergence

6 Glory of OSS Number of Active Apache Hosts

7 Problem Domain SourceForge.net community The biggest OSS development communities 134,751 registered projects 1,439,773 registered users

8 Our Data Set Problem Domain 25 monthly dumps since January Totally 460G and growing at 25G/month. Every dump has about 100 tables. Largest table has up to 30 million records. Experiment Environment Dual Xeon 3.06GHz, 4G memory, 2T storage Linux ELsmp with PostgreSQL 8.1

9 Related Research OSS research W. Scacchi, Free/open source software development practices in the computer game community, IEEE Software, C. Kevin, A. Hala and H. James, Defining open source software project success, 24th International Conference on Information Systems, Seattle, Complex networks L.A. Adamic and B.A. Huberman, Scaling behavior of the world wide web, Science, M.E.J. Newman, Clustering and preferential attachment in growing networks, Physics Review, 2001.

10 Process I: Data Mining Related Research: S. Chawla, B. Arunasalam and J. Davis, Mining open source software (OSS) data using association rules network, PAKDD, D. Kempe, J. Kleinberg and E. Tardos, Maximizing the spread of influence through a social network, SIGKDD, C. Jensen and W. Scacchi, Data mining for software process discovery in open source software development communities, Workshop on Mining Software Repositories, 2004.

11 Process I: Data Mining Algorithm Application Feature Selection Relevant data Data Purging Database Data Preparation Raw data

12 Process I: Data Mining Data Preparation Data discovery Locating the information Data characterization Activity features: user categorization Network features Data assembly Data Purging Treatment about data inconsistency Unifying the date presentation by loading into single depository Treatment about data pollution Removing inactive projects Feature Selection This method is used to remove dependent or insignificant features. NMF (Non-negative Matrix Factorization)

13 Result I Process I: Data Mining Significant features By feature selection, we can identify the significant feature set describing the projects. Activity features: file_releases, followup_msg, support_assigned, feature_assigned and task related features Network features: degrees, betweenness and closeness

14 Process I: Data Mining Distribution-based clustering (Christley, 2005) Clustering according to the distribution of features instead of values of individual feature We assume every entity (project) has an underlying distribution of the feature set (activity features) Using statistical hypothesis test Non-parametric test Fisher s contingency-table test is used Joachim Krauth, Distribution-free statistics: an application-oriented approach, Elsevier Science Publisher, 1988.

15 Process I: Data Mining Procedure: While (still unclustered entities) Put all unclustered entities into one cluster While (some entities not yet pairwise compared) A = Pick entity from cluster For each other entity, B, in cluster not yet compared to A Run statistical test on A and B If significant result Remove B from cluster Worst case complexity: O(n 2 )

16 Process I: Data Mining Result II Unsupervised learning Distribution-based method used to cluster the project history using the activity distribution We named the clusters using ID and the results are shown in the table High support and confidence in evaluation Cluster ID Total Size

17 Process I: Data Mining Two sample distributions from different categories Unbalanced feature distribution could be unpopular Balanced feature distribution could be popular Cluster Activity Category Cluster Activity Category

18 Process I: Data Mining Discoveries in Process I Significant feature set selection Network features are important Further inspection in next process Distribution based predictor Based on the activity feature distribution Prediction of the popularity based on the balance of the activity feature distribution Benefit of these discoveries For collaboration based communities, these discoveries can help in resource allocation optimization.

19 Process II: Network Analysis Why network analysis Assess the importance of the network measures to the whole network and to individual entity in the network Inspect the developing patterns of these network measures Network analysis Structure analysis Centrality analysis Path analysis

20 Process II: Network Analysis Related research: P. Erdös and A. Rényi, On random graphs, Publicationes Mathematicae, D.J. Watts and S. H. Strogatz, Collective dynamics of small-world networks, Nature, R. Albert and A.L. Barabάsi, Emergence of scaling in random networks, Science, Y. Gao, Topology and evolution of the open source software community, Master Thesis, 2003.

21 Process II: Network Analysis Structure Analysis Understanding the influence of the network structure to individual entities in the network Inspected measures Approximate diameter log( N / z D = log( z / z Approximate clustering coefficient Component distribution 2 1 ) ) C = 2 ( µ 2 " µ 1)(! 2 "! 1) 1+ µ! (2! " 3! +! )

22 Process II: Network Analysis Conversion among C-NET, P-NET and D- NET

23 Process II: Network Analysis Result I Approximate Diameters D-NET: between (5,7) while network size ranged from 151,803 to 195,744. P-NET: between (6,8) while network size ranged from 123,192 to 161,798. Approximate Clustering Coefficient D-NET: between (0.85, 0.95) P-NET: between (0.65, 0.75)

24 Process II: Network Analysis Result I

25 Process II: Network Analysis Centrality Analysis Understanding the importance of individual entities to the global network structure Inspected measures: Average Degrees Degree Distributions Betweenness B( v) Closeness = C( v) =! s# v# t" V $ st ( v) $ 1 st! " d t V G ( v, t )

26 Process II: Network Analysis Result II Average Degrees Developer degree in C-NET: Project degree in C-NET: Developer degree in D-NET: Project degree in P-NET:

27 Process II: Network Analysis Result II (Degree distributions in C-NET)

28 Process II: Network Analysis Result II (Degree distributions in D-NET and P-NET)

29 Process II: Network Analysis Result II Average Betweenness P-NET: e-003 Average Closeness P-NET: e-005 Normally these two measures yield very small value in large networks (N>10,000).

30 Process II: Network Analysis Path Analysis Understanding the developing patterns of the network structure and individual entities in the network Inspected measures: Active Developer Percentage Average Degrees Diameters Clustering coefficients Betweenness Closeness

31 Process II: Network Analysis Result III (Active entities)

32 Process II: Network Analysis Result III (Average degrees in C-NET)

33 Process II: Network Analysis Result III (Average degrees in D-NET and P-NET)

34 Process II: Network Analysis Result III (Diameters in D-NET and P- NET)

35 Process II: Network Analysis Result III (Clustering coefficients for D- NET and P-NET)

36 Process II: Network Analysis Result III (Average betweenness and closeness for P-NET)

37 Process II: Network Analysis Measures D-NET P-NET C-NET Average Degree Yes Yes Yes Diameter Yes Yes N/A Clustering Coefficient Yes Yes N/A Degree Distribution Yes Yes Yes Component Distribution N/A Yes N/A Major Component N/A Yes N/A Average Betweenness Yes Yes N/A Average Closeness Yes Yes N/A Active Entity Size Development Yes Yes Yes Average Degree Development Yes Yes Yes Diameter Development Yes Yes N/A Clustering Coefficient Development Yes Yes N/A Average Betweenness Development Yes Yes N/A Average Closeness Development Yes Yes N/A

38 Process II: Network Analysis Discoveries in Process II: Measures of structure analysis and centrality analysis all indicate very high connectivity of the network. Measures of path analysis reveal the developing patterns of these measures (life cycle behavior). Benefits of these discoveries High connectivity in a network is an important feature for information propagation, failure proof. Understanding this discovery can help us improve our practices in collaboration networks and communication networks. Understanding the developing patterns of these network measures provides us a method to monitor network development and to improve the network if necessary.

39 Process III: Computer Simulation Related Research: P.J. Kiviat, Simulation, technology, and the decision process, ACM Transactions on Modeling and Computer Simulation,1991. R. Albert and A.L. Barabási, Emergence of scaling in random networks, Science, J. Epstein R. Axtell, R. Axelrod and M. Cohen, Aligning simulation models: A case study and results, Computational and Mathematical Organization Theory, Y. Gao, Topology and evolution of the open source software community, Master Thesis, 2003.

40 Process III: Computer Simulation Iterative simulation method Empirical dataset Model Simulation Characterization Description Model Adjustment Generation Verification and validation More measures Empirical Data Collection Verification Validation Simulation More methods

41 Process III: Computer Simulation Previous iterated models (master thesis): Adapted ER Model BA Model BA Model with fitness BA Model with dynamic fitness Iterated models in this study Improved Model Four (Model I) Constant user energy (Model II) Dynamic user energy (Model III)

42 Process III: Computer Simulation Model I Realistic stochastic procedures. New developer every time step based on Poisson distribution Initial fitness based on log-normal distribution Updated procedure for the weighted project pool (for preferential selection of projects).

43 Process III: Computer Simulation Average degrees

44 Process III: Computer Simulation Diameter and CC

45 Process III: Computer Simulation Betweenness and Closeness

46 Process III: Computer Simulation Degree Distributions

47 Process III: Computer Simulation Deficit in the measures

48 Process III: Computer Simulation Model II New addition: user energy. User energy the fitness parameter for the user Every time a new user is created, a energy level is randomly generated for the user Energy level will be used to decide whether a user will take a action or not during every time step.

49 Process III: Computer Simulation Degree distributions for Model II

50 Process III: Computer Simulation Deficit in the measures

51 Process III: Computer Simulation Model III New addition: dynamic user energy. Dynamic user energy Decaying with respect to time Self-adjustable according to the roles the user is taking in various projects.

52 Process III: Computer Simulation Degree distributions (Model III)

53 Process III: Computer Simulation Models Measures Patterns in Data Simulated Patterns Developer Distribution Power Law (large tail) Power Law (small tail) Model I (more realistic distributions) Project Distribution Average Degrees Clustering Coefficient Diameter Power Law (small tail) Increasing Decreasing Decreasing Power Law (large tail) Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing Developer Distribution Power Law (large tail) Power Law (large tail) Model II (constant user energy) Project Distribution Average Degrees Clustering Coefficient Diameter Power Law (small tail) Increasing Decreasing Decreasing Power Law (reasonable tail) Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing Developer Distribution Power Law (large tail) Power Law (large tail) Project Distribution Power Law (small tail) Power Law (small tail) Model III (dynamic user energy) Average Degrees Clustering Coefficient Diameter Increasing Decreasing Decreasing Increasing Decreasing Decreasing Average Betweenness Decreasing Decreasing Average Closeness Decreasing Decreasing

54 Process III: Computer Simulation Discoveries in Process III Expanding the network models for modeling evolving complex networks (more parameters) Providing a validated model to simulate the community network at SourceForge.net Benefits of these discoveries Expanded network models can benefit other researchers in complex networks. Validated model for SourceForge.net can be used to study other OSS communities or similar collaboration networks.

55 Process IV: Research Collaboratory Related Research: G. Chin Jr. and C. Lansing, The biological sciences collaboratory, Mathematics and Engineering Techniques in Medicine and Biological Sciences, L. Koukianakis, A system for hybrid learning and hybrid psychology, Cybernetics and Information Technologies, Systems and Applications, NCBI, FlyBase, Ensembl, VectorBase

56 Process IV: Research Collaboratory What is Collaboratory? An elaborate collection of data, information, analytical toolkits and communication technologies A new networked organizational form that also includes social processes, collaboration techniques and agreements on norms, principles, value, and rules

57 Process IV: Research Collaboratory

58 Process IV: Research Collaboratory Data tier - schema design SF0205 Timeline SF0305 SF0405 Every schema is a database dump from the SourceForge.net SF0103 SF0605 SF0505 SF0805 SF0705

59 Process IV: Research Collaboratory Data tier - connection pool Logic Tier Connection Request Connection Assigner Connection Pool Persistent Link Persistent Link Persistent Link Timeline

60 Process IV: Research Collaboratory Presentation Tier Various access methods Documentation and references Community support Wiki interface

61 Logic Tier Process IV: Research Collaboratory Interactive web query system Authorized user can submit query to the back end repository through the web query Results are provided by files with various formats Dynamic web schema browser Authorized user can access the dynamic schema of the repository through the schema browser

62 Process IV: Research Collaboratory Utilization reports Monthly statistics (June 2006) Total queries submitted: 16,947 Total data files retrieved: 13,343 Total bytes of query data downloaded: 26,684,556,278 Programmable access method Programmable access method should be provided for complicated access Web services planned

63 Process IV: Research Collaboratory Results in Process IV Designing, implementing and maintaining a research collaboratory for OSS related research. Benefits of these results OSS researchers can access one of the most complete data sets for a OSS community development. By providing the community service to OSS researchers, the collaboratory can help in sparkling, improving and promoting research ideas about OSS.

64 Contributions Designed and demonstrated a computational discovery methodology to study evolving complex networks using research on OSS as a representative problem domain Understanding the OSS movement by applying the methods. Process I: data mining Identifying significant features to describe a project Using distribution based clustering to generate a distribution based predictor to predict the popularity of a project Process II: network analysis Introducing more complete analysis to inspect more complete data set from SourceForge.net. Discovering high connectivity and possible life cycle behaviors in both the network structure and individuals in the network Process III: computer simulation Introducing more parameters in modeling evolving complex networks Generating a fit model to replicate the evolution of the SourceForge.net community. Process IV: research collaboratory Designing, implementing and maintaining a research collaboratory to host the SourceForge.net data set and provide community support for OSS related researches.

65 Publications to-date Y. Gao; G. Madey and V. Freeh. Modeling and simulation of the open source software community, ADSC, San Diego, Y. Gao and G. Madey. Project development analysis of the oss community using st mining, NAACSOS, Notre Dame, S. Christley; Y. Gao; J: Xu and G. Madey. Public goods theory of the open source software development community, Agent, Chicago, Y. Gao, Y. Huang and G. Madey, Data Mining Project History in Open Source Software Communities, NAACSOS, Pittsburgh, J. Xu, Y. Gao, J. Goett and G. Madey, A Multi-model Docking Experiment of Dynamic Social Network Simulations, Agent, Chicago, Y. Gao, V. Freeh, and G. Madey, Analysis and Modeling of the Open Source Software Community, NAACSOS, Pittsburgh, Y. Gao, V. Freeh, and G. Madey, Conceptual Framework for Agentbased Modeling and Simulation, NAACSOS, Pittsburgh, G. Madey; V. Freeh; R: Tynan and Y. Gao. Agent-based modeling and simulation of collaborative social networks, AMCIS, Tampa, Y. Gao; V. Freeh and G. Madey. Topology and evolution of the open source software community, SwarmFest, Notre Dame, 2003.

66 Publication Plan Chapter III (data mining) Journal of Machine Learning Research Journal of Systems and Software Chapter IV (network analysis) Journal of Network and Systems Management Journal of Social Structure Chapter V (computer simulation) Spring Simulation Conference 2007 (under review) IEEE Computing in Science and Engineering Chapter VI (research collaboratory) CITSA 2007 Journal of Computer Science and Applications

67 Conclusion and Future Work Cyclic computational discovery method for studying evolving complex networks Study of Open Source Software by applying this method Future works: Maintaining and expanding the collaboratory Verifying the discoveries in the SourceForge.net against further accumulated database dump from SourceForge.net Applying our simulation model on other software development communities Extending our methodology to other evolving complex networks like Internet, communication network and various social networks

68 Acknowledgement My advisor: Dr. Madey My committee members: Dr. Flynn Dr. Striegel Dr. Wood My Colleagues: Scott Christley, Yingping Huang, Tim Schoenharl, Matt Van Antwerp, Ryan Kennedy, Alec Pawling and Jin Xu SourceForge.net managers: Jeff Bates, VP of OSTG Inc. Jay Seirmarco, GM of SourceForge.net. US NSF CISE/IIS-Digital Society & Technology, under Grant No

69 Questions

70 Case Study II Project 6882 OSS Developer Network (Part) Developers are nodes / Projects are links 24 Developers 5 Projects 2 hub Developers 1 Cluster dev[72] Project 7597 dev[64] dev[67] dev[52] Project 7028 dev[70] dev[65] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[55] dev[58] dev[99] dev[51] dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] dev[46] dev[46] dev[56] dev[83] dev[46] dev[48] 7597 dev[46] 7597 dev[46] dev[64] 7597 dev[46] dev[72] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[45] dev[61] dev[49] dev[83] Project dev[48] dev[56] dev[59] Project 9859

71 Process I: Data Mining Characteristics of data set Massive Incomplete, noisy, redundant Complex structures, unstructured Classic analysis tools are often inadequate and inefficient for analyzing these data, especially in exploratory research What is DM (Data mining) Nontrivial extraction of implicit, previously unknown and potentially useful information from data.

72 Process I: Data Mining Feature Selection Given a non-negative n x m matrix V, find factors W (n, r) and H (r, m), such that V W *H This is called the non-negative matrix factorization (NMF) of the matrix V NMF can be used on multivariate data to reduce the dimension of the data set By using NMF, we can reduce dimension from m features to r features

73 Why NMF? Feature extraction methods linear methods are simpler and more completely understood. nonlinear methods are more general and more difficult to analyze. Linear methods: ICA: Independent Component Analysis Matrix decomposition: PCA, SVD, NMF In practice, NMF is most popular and simple. Dimensionality reduction is effective if the loss of information due to mapping to a lowerdimensional space is less than the gain due simplifying the problem.

74 Process I: Data Mining Feature-based Clustering Grouping data into K number of clusters based on features. The distance metrics used is Euclidean distance like Hierarchical K-Means is used. The result is a binary tree. The root is the whole data set and the leaf clusters are the fine-grained clusters, which are the resulting K clusters.

75 Process I: Data Mining Case Study Result II Unsupervised learning K-Means method used to cluster the project history using the features we selected We named the clusters using ID and the results are shown in the table The result is not acceptable by evaluation Cluster ID Total Size

76 Process I: Data Mining Other tables artifact table Forum table People_job table Project_task table Doc_data table User_group table UNION User_project_act table Admin_flags? No Grantcvs? No Assigned? Yes No Yes Activities? Yes No Yes Administrator Core developer Co-developer Active user lurker

77 Process I: Data Mining

78 Clustering Result Evaluation Evaluation test set generation Popular/unpopular projects Stratified sampling to make 500 projects Feature sets used Popular feature set Activity Feature set (Page 34, Table 3.2) Network Feature set (Page35, Table 3.3) Generating rules for the test sets Calculating the support and confidence value

79 Popularity Definition Feature Developers Downloads Site_views Subdomain_views Page_views Description Number of core developers Number of downloads Number of views of the website Number of views of the subdomain Number of views of the pages

80 Why K-MEAN? The algorithm has remained extremely popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. K-Means is most successful algorithm in large data set (size>1000, dimension > 2) than GA and Evolution CLIQUE is sensitive to noise CURE is not scalable O(n 2 logn) CLARANS & BIRCH are not good for high dimension data D. Arthur, S. Vassilvitskii (2006): "How Slow is the k- means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).

81 K-MEAN It maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the result has a global minimum of variance. Multiple run is needed. Elbow criterion

82 Distribution Categories Category Feature File release New message Followup message Artifact request Todo request Support request Feature request Patch request Bug reports Bug assigned Patch assigned Feature assigned Support assigned Todo assigned Artifact assigned

83 Process III: Computer Simulation Start Simulation model procedure New Users User List User_Project Links Project List User Action Project Pool Update Idle Drop Create Join Weighted Project Pool End of Simu? No Yes Stop

84 Process III: Computer Simulation Poisson Process: It expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event. PDF: e F( k;!) = k! k! "!

85 Process III: Computer Simulation Log-normal distribution:

86 Process III: Computer Simulation Kolmogorov-Smirnov test Used to determine whether two underlying onedimensional distributions differ. Two one-sided K-S test statistics are given by D D + n! n = max( F n ( x) = max( F( x)! F! F( x)) n ( x))

87 Process III: Computer Simulation

88 Similar Publications Chapter III (data mining) JMLR: G. Hamerly, E. Perelman..Using machine learning to guide simulation (Feb. 2006) JSS: S. Kim, J. Yoon..Shape-based retrieval in time-series database (Feb. 2006) Chapter IV (network analysis) JNSM: Special Issue Self-Managing Systems and Networks JoSS: The Journal of Social Structure (JoSS) is an electronic journal of the International Network for Social Network Analysis (INSNA) Chapter V (computer simulation) SSC 2007: simulation co IEEE/CSE: E. Luijten..Fluid simulation with monte carlo algorithm (2006 Vol. 8, Issue 2) Chapter VI (research collaboratory) CITSA 2007: L. Koukianakis..A system for hybrid learning and hybrid psychology (2005) JCSA: S. Chen, K. Wen..An Integrated System for Cancer-Related Genes Mining from Biomedical Literatures (2006)

Data Mining Project History in Open Source Software Communities

Data Mining Project History in Open Source Software Communities Data Mining Project History in Open Source Software Communities Yongqin Gao ygao@nd.edu Yingping Huang yhuang3@nd.edu Greg Madey gmadey@nd.edu Abstract Understanding the Open Source Software (OSS) movement

More information

ModelingandSimulationofthe OpenSourceSoftware Community

ModelingandSimulationofthe OpenSourceSoftware Community ModelingandSimulationofthe OpenSourceSoftware Community Yongqin Gao, GregMadey Departmentof ComputerScience and Engineering University ofnotre Dame ygao,gmadey@nd.edu Vince Freeh Department of ComputerScience

More information

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY Jin Xu,Yongqin Gao, Scott Christley & Gregory Madey Department of Computer Science and Engineering University of Notre Dame Notre

More information

COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame

COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS. A Dissertation. Submitted to the Graduate School. of the University of Notre Dame COMPUTATIONAL DISCOVERY IN EVOLVING COMPLEX NETWORKS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor

More information

A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS ABSTRACT

A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS ABSTRACT A MULTI-MODEL DOCKING EXPERIMENT OF DYNAMIC SOCIAL NETWORK SIMULATIONS Jin Xu Yongqin Gao Jeffrey Goett Gregory Madey Dept. of Comp. Science University of Notre Dame Notre Dame, IN 46556 Email: {jxu, ygao,

More information

The Computer Experiment in Computational Social Science

The Computer Experiment in Computational Social Science The Computer Experiment in Computational Social Science Greg Madey Yongqin Gao Computer Science & Engineering University of Notre Dame http://www.nd.edu/~gmadey Eighth Annual Swarm Users/Researchers Conference

More information

Open Source Software Developer and Project Networks

Open Source Software Developer and Project Networks Open Source Software Developer and Project Networks Matthew Van Antwerp and Greg Madey University of Notre Dame {mvanantw,gmadey}@cse.nd.edu Abstract. This paper outlines complex network concepts and how

More information

Modeling and Simulation of a Complex Social System: A Case Study

Modeling and Simulation of a Complex Social System: A Case Study Modeling and Simulation of a Complex Social System: A Case Study Yongqin Gao Computer Science and Engineering Dept. University of Notre Dame Notre Dame, IN 66 ygao@nd.edu Vincent Freeh Department of Computer

More information

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY. Jin Xu Yongqin Gao Scott Christley Gregory Madey

A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY. Jin Xu Yongqin Gao Scott Christley Gregory Madey Proceedings of the 8th Hawaii International Conference on System Sciences - A TOPOLOGICAL ANALYSIS OF THE OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITY Jin Xu Yongqin Gao Scott Christley Gregory Madey Dept.

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Analysis of Activity in the Open Source Software Development Community

Analysis of Activity in the Open Source Software Development Community Analysis of Activity in the Open Source Software Development Community Scott Christley and Greg Madey Dept. of Computer Science and Engineering University of Notre Dame Notre Dame, IN 44656 Email: {schristl,gmadey}@nd.edu

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

AGENT-BASED MODELING AND SIMULATION OF COLLABORATIVE SOCIAL NETWORKS

AGENT-BASED MODELING AND SIMULATION OF COLLABORATIVE SOCIAL NETWORKS AGENT-BASED MODELING AND SIMULATION OF COLLABORATIVE SOCIAL NETWORKS Greg Madey Yongqin Gao Computer Science University of Notre Dame gmadey@nd.edu ygao1@nd.edu Vincent Freeh Computer Science North Carolina

More information

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks Imre Varga Abstract In this paper I propose a novel method to model real online social networks where the growing

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? Abstract An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

More information

The Importance of Social Network Structure in the Open Source Software Developer Community

The Importance of Social Network Structure in the Open Source Software Developer Community The Importance of Social Network Structure in the Open Source Software Developer Community Matthew Van Antwerp Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556

More information

11 Application of Social Network Analysis to the Study of Open Source Software

11 Application of Social Network Analysis to the Study of Open Source Software Elsevier AMS 0bsd -3-8:p.m. Page: 2 The Economics of Open Source Software Development Jürgen Bitzer and Philipp J. H. Schröder (Editors) Published by Elsevier B.V. Application of Social Network Analysis

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Modeling the Free/Open Source Software Community: A Quantitative Investigation

Modeling the Free/Open Source Software Community: A Quantitative Investigation Modeling the Free/Open Source Software Community: A Quantitative Investigation Gregory Madey Computer Science & Engineering University of Notre Dame Phone: 574-631-8752 Fax: 574-631-9260 gmadey@nd.edu

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Using Networks to Visualize and Understand Participation on SourceForge.net

Using Networks to Visualize and Understand Participation on SourceForge.net Nathan Oostendorp; Mailbox #200 SI708 Networks Theory and Application Final Project Report Using Networks to Visualize and Understand Participation on SourceForge.net SourceForge.net is an online repository

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Complex Networks Analysis: Clustering Methods

Complex Networks Analysis: Clustering Methods Complex Networks Analysis: Clustering Methods Nikolai Nefedov Spring 2013 ISI ETH Zurich nefedov@isi.ee.ethz.ch 1 Outline Purpose to give an overview of modern graph-clustering methods and their applications

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Oriented Network Evolution Mechanism for Online Communities Caihong Sun and Xiaoping Yang School of Information, Renmin University of China, Beijing 100872, P.R. China {chsun.vang> @ruc.edu.cn

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com>

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com> IC05 Introduction on Networks &Visualization Nov. 2009 Overview 1. Networks Introduction Networks across disciplines Properties Models 2. Visualization InfoVis Data exploration

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

Supporting Knowledge Collaboration Using Social Networks in a Large-Scale Online Community of Software Development Projects

Supporting Knowledge Collaboration Using Social Networks in a Large-Scale Online Community of Software Development Projects Supporting Knowledge Collaboration Using Social Networks in a Large-Scale Online Community of Software Development Projects Masao Ohira Tetsuya Ohoka Takeshi Kakimoto Naoki Ohsugi Ken-ichi Matsumoto Graduate

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Jurij Leskovec, CMU Jon Kleinberg, Cornell Christos Faloutsos, CMU 1 Introduction What can we do with graphs? What patterns

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Graph Mining Techniques for Social Media Analysis

Graph Mining Techniques for Social Media Analysis Graph Mining Techniques for Social Media Analysis Mary McGlohon Christos Faloutsos 1 1-1 What is graph mining? Extracting useful knowledge (patterns, outliers, etc.) from structured data that can be represented

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

Reinventing Business Intelligence through Big Data

Reinventing Business Intelligence through Big Data Reinventing Business Intelligence through Big Data Dr. Flavio Villanustre VP, Technology and lead of the Open Source HPCC Systems initiative LexisNexis Risk Solutions Reed Elsevier LEXISNEXIS From RISK

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

Keywords: Mobility Prediction, Location Prediction, Data Mining etc

Keywords: Mobility Prediction, Location Prediction, Data Mining etc Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Data Mining Approach

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study A Hybrid Decision Tree Approach for Semiconductor Manufacturing Data Mining and An Empirical Study 1 C. -F. Chien J. -C. Cheng Y. -S. Lin 1 Department of Industrial Engineering, National Tsing Hua University

More information

Application of Predictive Analytics for Better Alignment of Business and IT

Application of Predictive Analytics for Better Alignment of Business and IT Application of Predictive Analytics for Better Alignment of Business and IT Boris Zibitsker, PhD bzibitsker@beznext.com July 25, 2014 Big Data Summit - Riga, Latvia About the Presenter Boris Zibitsker

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 sanjeevk@iasri.res.in 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Clustering Data Streams

Clustering Data Streams Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

More information

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing Volume 4, Issue 12, December 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Lifecycle

More information

Exploring Big Data in Social Networks

Exploring Big Data in Social Networks Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

More information

Building well-balanced CDN 1

Building well-balanced CDN 1 Proceedings of the Federated Conference on Computer Science and Information Systems pp. 679 683 ISBN 978-83-60810-51-4 Building well-balanced CDN 1 Piotr Stapp, Piotr Zgadzaj Warsaw University of Technology

More information

Performance Analysis of Book Recommendation System on Hadoop Platform

Performance Analysis of Book Recommendation System on Hadoop Platform Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Clustering Methods in Data Mining with its Applications in High Education

Clustering Methods in Data Mining with its Applications in High Education 2012 International Conference on Education Technology and Computer (ICETC2012) IPCSIT vol.43 (2012) (2012) IACSIT Press, Singapore Clustering Methods in Data Mining with its Applications in High Education

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Graph-theoretic Properties of Biological Networks COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Architectural features Motifs, modules,

More information

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation

More information

GENERATING AN ASSORTATIVE NETWORK WITH A GIVEN DEGREE DISTRIBUTION

GENERATING AN ASSORTATIVE NETWORK WITH A GIVEN DEGREE DISTRIBUTION International Journal of Bifurcation and Chaos, Vol. 18, o. 11 (2008) 3495 3502 c World Scientific Publishing Company GEERATIG A ASSORTATIVE ETWORK WITH A GIVE DEGREE DISTRIBUTIO JI ZHOU, XIAOKE XU, JIE

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

More information

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information