EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS
|
|
|
- Hilary Ryan
- 10 years ago
- Views:
Transcription
1 EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS Susan P. Imberman Ph.D. College of Staten Island, City University of New York Abstract - The KDD (Knowledge Discovery in bases) paradigm is a step by step process for finding interesting patterns in large amounts of data. mining is one step in the process. This paper defines the KDD process and discusses three data mining algorithms, neural networks, decision trees, and association/dependency rule algorithms. The algorithms' potential as good analytical tools for performance evaluation is shown by looking at results from a computer performance dataset. 1. Introduction Knowledge Discovery in bases (KDD) is the automated discovery of patterns and relationships in large databases. Large databases are not uncommon. Cheaper and larger computer storage capabilities have contributed to the proliferation of such databases in a wide range of fields. Scientific instruments can produce terabytes and petabytes of data at rates reaching gigabytes per hour. Point of sale information, government records, medical records and credit card data, are just a few other sources for this information explosion. Not only are there more large databases, but the databases themselves are getting larger. The number of fields in large databases can approach magnitudes of 10 2 to Record numbers in these databases approach magnitudes of [7] Computer performance data has also increased in size and dimensionality. It is much easier to store data than it is to make sense of it. Being able to find relationships in large amounts of stored data can lead to enhanced analysis strategies in fields such as marketing, computer performance analysis, and data analysis in general. The problem addressed by KDD is to find patterns in these massive datasets. Traditionally data has been analyzed manually, but there are human limits. Large databases offer too much data to analyze in the traditional manner. The focus of this paper is to first summarize exactly what the KDD process is. This paper attempts to identify parts of the process that are currently being practiced by performance analysts and those parts that are not practiced, but may be useful in analyzing performance data The paper is organized as follows: Section 2 defines the KDD process. Section 3 discusses three data mining algorithms that are not commonly used by performance analysts. The paper concludes with a discussion in section KDD defined KDD employs methods from various fields such as machine learning, artificial intelligence, pattern recognition, database management and design, statistics, expert systems, and data visualization. It is said to employ a broader model view than statistics and strives to automate the process of data analysis, including the art of hypothesis generation. KDD has been more formally defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [7 ] The KDD Process is a highly iterative, user involved, multistep process, as can be seen in figure 1.
2 Selection Preprocessing Reduction Coding KDD Process Mining Visualization Target Clean Transformed Patterns Report Results Interpretation Organizational ITERATIVE Knowledge Figure 1 The KDD Process From figure 1 we see that initially, we have organizational data. This data is the operational data gathered either in one or several locations. All operational data is collected and brought to some central location. These central locations are sometimes called Warehouses or Marts. transformation may be performed on the raw data before it is placed in the data warehouse. transformation resolves inconsistencies between the data of one location from that of another. For example, inconsistencies may be differences in data type for the same information and different field names for the same data field. warehouses hold both detailed and summary data. Detailed data is used for pattern analysis, where summarized data may hold the results of previous analyses. warehouses also contain historical data whereas operational data is usually current. Efficient organization of the data warehouse is essential for efficient data mining. Once the data is organized, a selection process occurs where some subset of this data becomes the target data upon which further analysis is performed. It is important when creating this target data that the data analyst understand the domain, the end user s needs, and what the data mining task might be. Sometimes data is collected in an ad hoc manner. entry mistakes can occur and/or the data may have missing or unknown entries. During the data cleaning and preprocessing stage noise is removed from the data. Outliers and anomalies in the data can pose special problems for the data analyst during the data cleaning process. Since the goal is to find rare patterns in the data, the outliers and anomalies may be representations of these rare patterns. Care must be taken not to remove these types of outliers and anomalies. This step in the process can be the most time consuming. Reduction and Coding step employs transformation techniques that are used to reduce the number of variables in the data by finding useful features with which to represent the data. Mining constitutes one step in the KDD process. The transformed data is used in the data mining step. It is in this step that the actual search for patterns of interest is performed. The search for patterns is done within the context of the data mining task and the representational model under which the analysis is being performed. It is important at this stage to decide on the appropriate data mining algorithm (linear/logistic regression, neural networks, association rules, etc.) for the data mining task (classification, database segmentation, rule generation, etc). The data mining task itself can be a classification task, linear regression analysis, rule formation, or cluster analysis. Patterns generated in the data mining step may not be new or interesting. It is therefore necessary to remove redundant and irrelevant patterns from the set of useful patterns. Once a set of good patterns have been discovered, they then have to be reported to the end user. This can be done can be done textually, by
3 way of reports or using visualizations such as graphs, spreadsheets, diagrams, etc. The interpretation step takes the reported results and interprets this into knowledge. Interpretation may require that we resolve possible conflicts with previously discovered knowledge since new knowledge may even be in conflict with knowledge that was believed before the process began. When this is done to user's satisfaction, the knowledge is documented and reported to any interested parties. This again may involve visualization. It is important to stress that the KDD process is not linear. Results from one phase in the process may be fed back into that phase or into another phase. Current KDD systems have a highly interactive human component. Humans are involved with many if not each step in the KDD process. Hence, the KDD process is highly interactive and iterative. 3. Mining mining is one step in the KDD process. It is the most researched part of the process. mining algorithms find patterns in large amounts of data by fitting models that are not necessarily statistical models. This is done within computational limits such as time (answers should be found within a finite amount of time) and hardware. mining can be used to verify a hypothesis or use a discovery method to find new patterns in the data that can do predictions with new, unseen data. The term data mining is a somewhat recent one. In the past data mining has been known in the literature as knowledge extraction, information discovery, information harvesting, exploratory data analysis, data archeology, data pattern processing, and functional dependency analysis. Performance analysts employ many of the same methods that data miners use when analyzing data. These include statistical models such as linear and logarithmic regression, cluster analysis, graphical analysis, and visualizations. There are some algorithms employed by data miners that are not usually used by performance analysts. Decision tree algorithms, neural networks, and association/dependency rule algorithms can be useful analytical tools for the performance analyst. Rather than discuss techniques that are often used by performance analysts, the remainder of this paper will elaborate upon these three algorithms. A performance dataset will be used to describe these algorithms and their use in performance analysis. The dataset contains data collected over a period of several months. A summary dataset was created that aggregates the measurements collected every 15 minutes into daily summary points. Our simple dataset contains 6 metrics: Date, Total Number of transactions, CPU busy %, Disk busy %, Average Response Time, and Average Network Busy % The reader can observe from cursory data analysis (see the Figure 2 below) that the key relationships inherent in the data is as time moves forward, the number of transactions increases, and network busy percent decreases. Although this is a simplistic dataset in that the patterns in the dataset are quite obvious, because of this, the dataset is well suited for demonstration of the data mining techniques in the following sections. M e trics By Da te #trans cpu busy disk busy response network busy 10/1/93 10/15/93 10/29/93 11/12/93 11/26/93 12/10/93 12/24/93 1/7/94 1/21/94 2/4/94 2/18/94 Date Figure 2 - Relationships
4 3.1 Association/Dependency Rule Algorithms Dependency modeling uses methods that describe variable dependencies and associations. The results of these algorithm are what is known as association or dependency rules. [1] [2] [3] Association rule algorithms were developed to analyze market basket data. They identified groups of market items that customers tended to buy in association with each other. For example, customers tend to buy soap, shampoo, and hairspray at the same time. Because of their origins, much association rule terminology stems from this domain. Given a set of items I, (or a set of variables from each transaction or record) an association rule is a probabilistic implication X Y where X and Y are subsets of I and X Y = φ. Interesting association rules are defined by two metrics, support and confidence. Support is a measure of a rule's significance with respect to the database, and confidence is a measure or the rule's strength.[1] CPU busy Network Busy Response Time High Network Busy High Table 1 Sample set More formally, let N be the number of transactions (records) in a database. A set of items is said to satisfy a transaction if that transaction contains those items. For example, given N = 10 as in Table 1, if our item set were {CPU busy = 1, Network busy = 1}, then all records containing values that match the itemset for CPU busy and Network busy would satisfy that set. In Table 1, there are 5 transactions that satisfy the itemset {CPU busy = 1, Network busy = 1}. Given an association rule X Y, let S 1 be the number of transactions that contain or satisfy X. Let S 2 be the number of observations in T that contain or satisfy X Y. The support s of X Y is S 2 / N (or the s% of the transactions/records that contain X Y) and the confidence c of X Y is S 2 / S 1 (the c% of all transactions N that contain X which also contain Y). For example, if we were to look at Table 1, given the association rule CPU busy = 1 Network Busy = 1, then the support would be 5/10 or 50%. Since there are 8 transactions that contain CPU busy = 1, the confidence for this association rule would be 5/8 or 63%. Note if the association rule we were looking at was Network Busy = 1 CPU busy = 1, the support would still be 50% but the confidence would be 5/7 or 72% since there are seven transactions with Network Busy = 1. The support of a rule is the rule's statistical significance in the dataset. The confidence is a measure of the rule's strength relative to the dataset. A rule that has a support above a user defined threshold, minsup, and confidence above a user defined threshold minconf, is an interesting association rule. The problem of finding interesting association rules was first explored by Agrawal et. al. [1] [2] [3]. Agrawal et. al. decomposed the problem of finding association rules into two parts. The first focuses on finding large itemsets. A large itemset is a set of database items in that have support in the database greater than the user specified minimum (minsup). Once the set of large itemsets is found, the second part of the problem, uses these itemsets to generate rules that have confidence above a user defined minimum (minconf). Finding rules from large itemsets is straightforward. Finding the large itemsets themselves is an exponential problem, and hence computationally expensive to solve. Agrawal et. al. proposed the Apriori algorithm to find large itemsets and the association rules based on these large itemsets. [ 1] [ 2] [3]. In order to use Apriori with the computer performance data, we needed to "booleanize" the data. Association rule algorithms evaluate data that has either a zero or one value. This evolved from their origins which was the analysis of market basket data. A customer bought an item (value = 1) or didn't buy the item ( value = 0). Booleanizing numeric data involves finding a threshold, above which every data point becomes a one, and below which it becomes zero. Since mean has been shown to yield a good threshold, it was used to booleanize the computer performance data. [10] The results given by Apriori with minsup set to 50% and minconf set to 80% on the booleanized computer performance data yielded the following two rules: # transactions date (support = 51.0%, confidence = 98.7%) date # transactions (support = 50.3%, confidence = 100.0%) Note that association rule algorithms only find positive dependencies. In computer performance, negative dependencies are also significant. Silverstein, Brin, and Motwani proposed a modification of Apriori that finds negative dependencies. [14] Domanski, Imberman, and Orchard have proposed the Boolean Analyzer algorithm that also finds these negative dependencies. The resulting rules are ordered based on a probabilistic interestingness measure (PIM). [6] [10] [14] Results from Boolean Analyzer on the
5 booleanized computer performance data are shown in table2. Rule date = 0 # transactions = 0 date = 1 # transactions = 1 date = 0 network busy = 1 date = 1 network busy = 0 date = 0 # transactions = 0 && network busy = 1 date = 1 # transactions = 1 && network busy = 0 Interpretation As time decreases, Number of Transactions decreases As time increases, Number of Transactions increases As time decreases, network busy increases As time increases, network busy decreases As time decreases, Number of Transactions decreases and network busy increases As time increases, Number of Transactions increases and network busy decreases Table 2 - Results From Boolean Analyzer Here we not only see the positive dependencies but the negative dependency between the date and network busy. 3.2 Decision Trees The model representation for decision trees is a tree data structure. As in figure 3 each node in the tree represents an attribute. Node attribute labels are chosen by a function which can separate the attributes such that values within a node are most similar and between nodes are dissimilar. Different decision tree algorithms use different functions to label each node. The algorithm C4.5 uses information gain to decide which attribute labels a node.[11] CART uses the Gini index. [4] CHAID uses a chi square statistic to make this determination. The attribute that best classifies the training examples is the one that labels the node. Each attribute value points to a child node. Nodes become leaves when its highest measured attribute classifies all training examples at that node or all attributes have labeled nodes. In Figure 3, A i stands for an attribute or variable in the dataset, where v ij are the specific values the decision tree algorithm found as good splitting values for this attribute. The input to a decision tree is a set of training examples. Training examples are records from the dataset where one attribute, the target variable, is correctly classified. In Figure 3, training examples were classified with the values yes or no. The tree algorithm learns and generalizes this classification and therefore can classify unseen examples. One of the drawbacks of decision trees is that they can tend to overfit the data. Different algorithms employ different methods of pruning the trees to compensate for this. Each path in the tree represents a conjunction of attributes. The tree is a disjunction of each conjunctive path. There can be many decision trees that can describe a given training set. Usually decision tree algorithms generate multiple trees and recommend the "more accurate" tree as a classifier. Another method for "deciding" between trees is to use these multiple trees as a "committee of experts" for classification. A2 v11 Decision Trees A1 A3 YES v12 v21 v22 v41 v42 v43 C4.5 requires that the target variable be categorical. To use the performance data with C4.5, the target variable, number of transactions, was booleanized. A value of 1 meant a high number of transactions, and 0 described a low number of transactions. For the performance dataset, C4.5 yielded the following results: v13 A4 A5 A3 A2 A6 A7 YES YES Figure 3 - Decision Trees v ij = value j of Attribute i A i = Attribute i
6 Class specified by attribute `num transactions' Read 152 cases (6 attributes) from 430DATA2.data Decision tree: network busy <= 45: 1 (77) network busy > 45: 0 (75) Results from the Salford Systems' version of CART on the performance data with a booleanized target variable (number of transactions) is shown in figure 4. Figure 4 - CART Decision Tree CART chose date as the splitting attribute as opposed to the network busy attribute chosen by C4.5. Actually both attributes are almost identical in their ability to classify the data with only hundredths of a percent difference between the two. The choice differences displayed by the two methods can be attributed to the different functions used by each algorithm to distinguish between attributes. CART can also evaluate data using a numeric target variable, thus creating a regression tree. The regression tree for the performance data using the original numeric values for number of transactions is shown below. Figure 5 - CART Regression Tree
7 N eural Networks inputs outputs w1x1 w0 x0 hidden units w2x2 wnxn o = n i = 0 w i x i σ Figure 4 - Neural Networks Each path in the tree can be interpreted into a rule. For example the far left (first) path yielded the rule: if (DATE <= ) { terminalnode = -1; mean = ; //average number of transactions } The second path yielded the rule: if ( DATE > && DATE <= ) { terminalnode = -2; mean = 505.4; //average number of transactions } Decision trees are good classifiers with easily interpreted results. 3.3 Neural Networks Neural Networks are a graphical model based on the human nervous system. [11] It consists of a system of connected nodes arranged in multiple layers. In Figure 4 we see that there is an input layer, a hidden layer (there can be several hidden layers), and an output layer. Inputs to the neural network are the same types of training examples that decision trees take as their input. Each node is associated with an input value and a weight. The weights are multiplied by their respective input values and summed. The result of some function of this sum is the output value from the node that is forwarded to the next layer of nodes. [5] [11]. The outputs of a neural network is a set of weights that are the coefficients in a linear regression equation. One method for determining the network weights is back propagation. Back propagation looks at the error between the networks' output value and the observed value and then adjusts the weights on each node so that this error is reduced. This is repeated for each example fed into the network and repeated multiple times for the entire dataset. Thus it is a highly iterative process. This process is called "training" the neural network. The neural network itself represents a learned function that can be used to predict on new examples. The representational model of neural networks is difficult to interpret, as opposed to decision trees since we don t always known what is happening in hidden nodes. Neural nets result in a linear regression equation rather than a set of rules. Although neural networks can be slow to train, once trained and implemented, neural networks can classify inputs very quickly. Nnmodel, a neural net program, was used to evaluate the performance data. [12] For this analysis, 20 randomly selected records were withheld as a test set. The neural net was trained on the remaining 132 records. This is a common practice done when training classifiers. The withheld data is used to benchmark the classifiers ability to predict on unseen data. Figure 5 shows the results of the neural network's predications as compared to the actual values on the withheld test set. Notice that the two are very similar, thus showing that the neural networks did a good job of predicting the number of transactions.
8 Measured and Predicted (num transactions) N = 20 num transactions per day Figure 5 Neural Network Results vs. Observed Results 4. Discussion The KDD process is a good paradigm for data analysis. One may wonder what the differences between traditional statistical analyses and the KDD process are. Hand [ 8 ] discusses these differences. He maintains that methods employed by KDD use statistical approaches that have the same rigor one usually employs in statistics, but many KDD methods tend to be more experimental and less conservative than those used in statistics. Thus they are more suited for a "discovery approach" to data analysis. Of the data mining techniques discussed in this paper, there are situations that favor one over the other. In addition each method has their drawbacks. Neural networks and decision trees are similar in that they are both classifiers. The classifier model is built and is then used to predict the target class for future unseen data. The advantage that decision trees have over neural networks is that the classification rules are directly readable from the model. Many data analysts don't like the "black box" model built by neural networks. This is because the result of a neural network model is a set of linear regression equations, thus making the reasons for classification difficult to interpret. Decision tree models are faster to build. It can take a long time to train a neural net. Notwithstanding, once built, the execution of a neural net model can be faster than that of a decision tree. In terms of accuracy, studies have revealed that one method is not necessarily more accurate on the other. Accuracy tends to depend on the actual data involved. Association/Dependency rule algorithms show concurrence of variables. They tend to generate many more rules than what is seen in decision trees. The advantage is that these algorithms give us the ability to find rare and less obvious patterns in the data. There have been attempts to use the rules generated by association/dependency rule algorithms for classification but using association/dependency rules in this manner has met with mixed results. It is important that these data mining algorithms not be used in an ad hoc manner. This has come to be known as data dredging. The fear is that doing so, one might discover patterns with no meaning. In fact at one time KDD in the statistics community was considered a dirty word. It has been shown that if one looks hard enough in a sufficiently large database, even a randomly generated one, statistically significant patterns can be found. We also have to be careful as to how we interpret the patterns presented to us by data mining algorithms. One classic data mining legend centers around a major retailer finding the relationship, Men who buy diapers, buy beer!! The retailer, might assume that harried dads, going to the store to pick up some diapers, were picking up something for themselves as well! Notwithstanding, the KDD process, and specifically the data mining step in the process, offer
9 the performance analyst additional approaches for data analysis. 5. Conclusions In this paper, a very simplistic computer performance dataset was used to demonstrate three data mining algorithms. The reasons for using such a simplistic set was to give the reader a "taste" for the methods discussed. Usually these algorithms are used on highly dimensional, larger, complex datasets. It was my hope that in demonstrating the functional abilities of association/dependency rules, neural networks, and decision trees on a simple set, the reader would be encouraged to try these methods on their own data. There are many resources available that implement these algorithms. A nice implementation of Apriori was coded by Christian Borgelt and can be found at: Along with a discussion of the algorithm, there is also downloadable code. Borgelt's version of Apriori was implemented in SPSS's data mining suite, Clementine. Another good site is: This site contains Weka 3 which is a suite of machine learning and data mining algorithms written in Java with a decent GUI. is a site that has many links to software, both commercial and free, along with other interesting links relating to the field of data mining. For those interested in reading more on the topic of data mining, [7] is a collection of papers that give a nice overview of the field. [11] is a good text on machine learning. [6] Gives a good introduction to Neural Networks. A recent text by Jiawei Han and Micheline Kamber titled, Mining Concepts and Techniques, Academic Press 2001, gives a good introduction to the field in general. Acknowledgements: I would like to thank Bernie Domanski for his help on this paper and keeping me focused on the needs of the computer performance professional. 6. References: 1. Agrawal, R., T. Imielinsk, and A. Swami. "Mining Association Rules between Sets of Items in Large bases." Proceedings of the ACM SIGMOD International Conference on the Management of, , Agrawal, R. Mannila, H., Srikant, R. Toivonen, H. and Verkamo, A. I., Fast Discovery Of Association Rules. In Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramaswamy Uthurusamy, editors, Advances in Knowledge Discovery and Mining, pages AAAI/MIT Press, Agrawal, R. and Srikant, R. Fast Algorithms For Mining Association Rules. In Proceedings of 20 th Intl. Conf. on Very Large bases (VLDB'94), pages , Santiago, Chile, Breiman, L., Friedman, J. H., Olshen, R. A., Stone, Charles J. Classification and Regresion Trees, Chapman and Hall/CRC, Domanski, B., A Neural Net Primer, Journal of Computer Resource Management, Issue 100, Fall Domanski, B., Discovering the Relationships Between Metrics. The Proceedings of the 1996 Computer Measurement Group. December 1996, San Diego California, Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data mining to knowledge discovery: An overview. In Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramaswamy Uthurusamy, editors, Advances in Knowledge Discovery and Mining, pages AAAI/MIT Press, Hand D. J., Statistics and Mining: Interesting Disciplines, Association for Computers and Machinery SIGKDD Explorations, June 1999 Vol. 1 Issue1, pgs Imberman, S Comparative Statistical Analyses Of Automated Booleanization Methods For Mining Programs (Doctoral dissertation, City University of New York, 1999). UMI Microform,
10 10. Imberman, S.P., Domanski, B., Orchard, R., Using Booleanized To Discover Better Relationships Between Metrics., CMG99 Proceedings, Mitchell, T. M., Machine Learning, WCB McGraw-Hill Nnmodel, Orchard, R. A. On the Determination of Relationships Between Computer System State Variables. Bell Laboratories Technical Memorandum, January 15, Silverstein, C., Sergey Brin, and Rajeev Motwani. Beyond Market Baskets: Generalizing Association Rules to Dependence Rules. Mining and Knowledge Discovery, 2(1):39-68, 1998
Mining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Dynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing
Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Healthcare Measurement Analysis Using Data mining Techniques
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Introduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction
Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
The KDD Process for Extracting Useful Knowledge from Volumes of Data
Knowledge Discovery in bases creates the context for developing the tools needed to control the flood of data facing organizations that depend on ever-growing databases of business, manufacturing, scientific,
ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL
International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India [email protected]
APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS. email [email protected]
Eighth International IBPSA Conference Eindhoven, Netherlands August -4, 2003 APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION Christoph Morbitzer, Paul Strachan 2 and
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
How To Use Data Mining For Loyalty Based Management
Data Mining for Loyalty Based Management Petra Hunziker, Andreas Maier, Alex Nippe, Markus Tresch, Douglas Weers, Peter Zemp Credit Suisse P.O. Box 100, CH - 8070 Zurich, Switzerland [email protected],
DATA MINING AND WAREHOUSING CONCEPTS
CHAPTER 1 DATA MINING AND WAREHOUSING CONCEPTS 1.1 INTRODUCTION The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
Why do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Chapter 1 Introduction SURESH BABU M ASST PROF IT DEPT VJIT 1 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
A New Approach for Evaluation of Data Mining Techniques
181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty
A Data Mining Tutorial
A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen
A Case Study in Knowledge Acquisition for Insurance Risk Assessment using a KDD Methodology
A Case Study in Knowledge Acquisition for Insurance Risk Assessment using a KDD Methodology Graham J. Williams and Zhexue Huang CSIRO Division of Information Technology GPO Box 664 Canberra ACT 2601 Australia
A Survey on Association Rule Mining in Market Basket Analysis
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 4 (2014), pp. 409-414 International Research Publications House http://www. irphouse.com /ijict.htm A Survey
Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis
, 23-25 October, 2013, San Francisco, USA Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis John David Elijah Sandig, Ruby Mae Somoba, Ma. Beth Concepcion and Bobby D. Gerardo,
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Principles of Dat Da a t Mining Pham Tho Hoan [email protected] [email protected]. n
Principles of Data Mining Pham Tho Hoan [email protected] References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,
Weather forecast prediction: a Data Mining application
Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,[email protected],8407974457 Abstract
Mining Association Rules: A Database Perspective
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 69 Mining Association Rules: A Database Perspective Dr. Abdallah Alashqur Faculty of Information Technology
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Knowledge Discovery Process and Data Mining - Final remarks
Knowledge Discovery Process and Data Mining - Final remarks Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 14 SE Master Course 2008/2009
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
How To Use Data Mining For Knowledge Management In Technology Enhanced Learning
Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning
Data Mining System, Functionalities and Applications: A Radical Review
Data Mining System, Functionalities and Applications: A Radical Review Dr. Poonam Chaudhary System Programmer, Kurukshetra University, Kurukshetra Abstract: Data Mining is the process of locating potentially
ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam
ECLT 5810 E-Commerce Data Mining Techniques - Introduction Prof. Wai Lam Data Opportunities Business infrastructure have improved the ability to collect data Virtually every aspect of business is now open
Selection of Optimal Discount of Retail Assortments with Data Mining Approach
Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.
PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
Building an Iris Plant Data Classifier Using Neural Network Associative Classification
Building an Iris Plant Data Classifier Using Neural Network Associative Classification Ms.Prachitee Shekhawat 1, Prof. Sheetal S. Dhande 2 1,2 Sipna s College of Engineering and Technology, Amravati, Maharashtra,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.
Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010 1 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Data Mining + Business Intelligence. Integration, Design and Implementation
Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Data Mining to Recognize Fail Parts in Manufacturing Process
122 ECTI TRANSACTIONS ON ELECTRICAL ENG., ELECTRONICS, AND COMMUNICATIONS VOL.7, NO.2 August 2009 Data Mining to Recognize Fail Parts in Manufacturing Process Wanida Kanarkard 1, Danaipong Chetchotsak
Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining a.k.a. Data Mining II Office 319, Omega, BCN EET, office 107, TR 2, Terrassa [email protected] skype, gtalk: avellido Tels.:
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Index Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
Data Mining Algorithms And Medical Sciences
Data Mining Algorithms And Medical Sciences Irshad Ullah [email protected] Abstract Extensive amounts of data stored in medical databases require the development of dedicated tools for accessing
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept
Statistics 215b 11/20/03 D.R. Brillinger Data mining A field in search of a definition a vague concept D. Hand, H. Mannila and P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge. Some definitions/descriptions
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Association rules for improving website effectiveness: case analysis
Association rules for improving website effectiveness: case analysis Maja Dimitrijević, The Higher Technical School of Professional Studies, Novi Sad, Serbia, [email protected] Tanja Krunić, The
Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad Email: [email protected]
96 Business Intelligence Journal January PREDICTION OF CHURN BEHAVIOR OF BANK CUSTOMERS USING DATA MINING TOOLS Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad
Association Rule Mining: A Survey
Association Rule Mining: A Survey Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University, Singapore 1. DATA MINING OVERVIEW Data mining [Chen et
Chapter ML:XI. XI. Cluster Analysis
Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster
Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing
FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT
FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT ANURADHA.T Assoc.prof, [email protected] SRI SAI KRISHNA.A [email protected] SATYATEJ.K [email protected] NAGA ANIL KUMAR.G
Chapter 5. Warehousing, Data Acquisition, Data. Visualization
Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI
Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) Faculty of Computer Science, University of Indonesia Objectives
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Foundations of Artificial Intelligence. Introduction to Data Mining
Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present
Rhodes University COMPUTER SCIENCE HONOURS PROJECT Literature Review. Data Mining with Oracle 10g using Clustering and Classification Algorithms
Rhodes University COMPUTER SCIENCE HONOURS PROJECT Literature Review Data Mining with Oracle 10g using Clustering and Classification Algorithms By: Nhamo Mdzingwa Supervisor: John Ebden Date: 30 May 2005
Data Mining and KDD: A Shifting Mosaic. Joseph M. Firestone, Ph.D. White Paper No. Two. March 12, 1997
1 of 11 5/24/02 3:50 PM Data Mining and KDD: A Shifting Mosaic By Joseph M. Firestone, Ph.D. White Paper No. Two March 12, 1997 The Idea of Data Mining Data Mining is an idea based on a simple analogy.
Three Perspectives of Data Mining
Three Perspectives of Data Mining Zhi-Hua Zhou * National Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract This paper reviews three recent books on data mining
