Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day business. The opening up of the information systems through the World Wide Web has helped organizations to amass huge amounts of data--ranging from interactions to transactions--that was previously not possible or was possible at a high cost for collecting the data. Organizations have perfected various techniques to churn this vast amount of data into reports that provide various facts and figures. This in turn has created an "information overload" rather than help these organizations glean any knowledge from this data. Data mining is the process of discovering and extracting patterns from data. Data mining applies a set of algorithms for the pattern extraction. When these patterns are analyzed with the help of prior knowledge and proper interpretation, the process is called Knowledge Discovery. The field of Knowledge Discovery in Databases (KDD) deals with data mining and interpretation of the data mining results to create knowledge from databases. In this paper we will discuss the KDD process, data mining algorithms, and the benefits of practicing KDD to businesses. Data mining involves determining patterns from, or fitting models to observed data. A typical data mining system may perform one or more of the following tasks: Association: Association is the discovery of correlation between a set of items. The output is often expressed in the form of a rule showing attribute value conditions that occur frequently together. This type of analysis is widely used in analyzing data for direct marketing campaigns and sales catalog designs and many other business decisionmaking processes. Example - An association model might discover that of all Electronic customers under study, the age group of 20-29 (10% of the set), with an income 40-50K buys DVD Players with 80% probability. Classification: Classification analyzes a set of training data and constructs a model for each class based on the features of the data. A decision tree or a set of classification rules is generated. This can further be used for better understanding of each class and classification of future data. Rich classification methods are inherited from the field of machine learning, neural networks, statistics and other fields. 1
Classification has been quite useful in customer segmentation and credit analysis type requirements. Example - A customer can be evaluated to be good or bad risk depending upon the income range, number of years in job and the amount of debt he or she is carrying. Prediction: Classification can be used for predicting the class label of data objects. Prediction also can be used for missing data value prediction. The classification produces the appropriate business rule or a decision tree from which the prediction can be made. Example - A customer's potential expenditure using a credit card can be predicted based on the expenditure distribution of similar customers using that credit card. Usually, genetic algorithms, regression analysis and neural networks are the commonly used techniques for this purpose. Clustering: A cluster is a collection of objects that are similar to one another. Clustering analysis refers to identifying clusters embedded in the data. A good clustering method produces high quality clusters. It means that inter-cluster similarity is low and the intracluster similarity is high. It is very commonly used for customer segmentation and deriving marketing strategies. Example - The customer base may be clustered around certain sets of attributes that uniquely determine the cluster membership. For example, location, income group, age group of the customers. Time-Series Analysis: This analysis is used to find regularities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations. Group1 Group 32 + + + Group 2 Cluster Centers Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance. Time-Series Analysis: This analysis is used to find regularities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations. Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance. Example Applications Retail/Marketing Identify buying patterns from customers Associations among customer demographic characteristics Predict response to mailing campaigns Banking Detect patterns of fraudulent credit card use Classifying customers for target marketing. Predict customers likely to change their credit card affiliation Determine credit card spending by customer groups 3 2 TranSys Technologies
Insurance Claims analysis Predict which customers will buy new policies Identify behavior patterns of risky customers Identify fraudulent behavior Telecommunication Call Behavior Analysis Churn Analysis Fraud Detection Call Center Performance e-commerce Recommendation System Website Access Profiling Personalization Clickstream analysis for Web Insurance Process of Data Mining A systematic approach is essential to successful data mining. has effectively used the process model described in this section. It should be noted that the data mining process is not linear. The loops in the process model indicate that the previous one or more steps may be revisited depending upon the result at that step. For example, the results of the data exploration phase may require you to add new data to the database. Usually a number of initial models are built in order to arrive at a satisfactory model. The following is a brief description of the data mining process phases adopted by TranSys in providing the data mining solution: 1. Business Definition Phase Prerequisite to knowledge discovery is to develop a clear understanding of the business environment. This is required in order to appreciate opportunities for improvement and also to prepare the data for mining, or correctly interpret the results. Clear statement of business objectives will make the best use of the data mining effort. TranSys will work with clients to clearly define the business objective. This definition stage will include a way of measuring the results of the data mining project and cost justification. 2. Data Building Phase In this phase the data to be mined is collected in a database. Depending on the amount and complexity of the data, many times, even a flat file or a spreadsheet may be adequate. The required components of data may be sourced from a data warehouse, as they ensure the required cleanliness of the data. Other data from external sources may have to be integrated. TranSys will perform the following tasks in order to achieve the objectives of this phase: Collect the required data Select the subset of data to be mined Assess data quality and if required, cleanse the data Consolidate and integrate the data Load the data mining database The Business Definition Phase and the Results Deployment Phase govern effectiveness of the entire process. 3
Business Definition Phase Data Building Phase Data Exploration Phase Data Preparation Phase Model Building Phase Model Evaluation Phase Results Deployment The Data Mining Process 3. Data Exploration Phase Understanding the data is very important. Graphing and visualization tools are a vital aid in understanding data and preparing data. Data visualization most often provides the "Aha!" leading to new insights and success. Some of the common and very useful graphical displays of data are histograms or box plots that display distributions of values. TranSys will work closely with the functional team members to identify the most important attributes and fields in predicting an outcome and determine which derived values may be useful. They will use visualization, link analysis and other means of exploring data. 4. Data Preparation Phase This is the final phase before building models. It is often good idea to sample the data when the database is large. If done carefully, this yields no loss of information. Data that is clearly extraneous need to be identified and discarded. It is often necessary to construct new variables derived from the raw data. For example, forecasting credit risk using a debtto-income ratio rather than just debt and income as predictor variable may yield more accurate results that are also easier to understand. Data may also need to be selectively segregated (discretized), for example decision trees used for classification require continuous data such as income to be grouped in ranges or bins - High, Medium and Low or given ranges. The cut-off points for the bins may change the outcome of a model. 4 TranSys Technologies
TranSys will perform the following tasks involved in this phase: Selection of variables Selection of rows Construction of new variables Transformation of variables 5. Model Building Phase Model building is an iterative process. One needs to explore alternative models to find the most useful in solving the business problem. Once you have decided on the type of prediction you want to make, you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or logistic regression. Based upon the results of building initial models, you may want to build another model using the same technique but different parameters. No tool or technique is perfect for all data, and it is difficult, if not impossible, to be sure before you start which technique will work the best. Quintegra chooses the strategy of building numerous models before finding a satisfactory one, which will provide accurate results for the purpose. 6. Model Evaluation Phase After building a model, you must evaluate its results. It is important to test the model in the real world. There is no guarantee that an accurate model reflects the real world. A valid model is not necessarily a correct model. In addition, the data used to build the model may fail to match the real world in some unknown way, leading to an incorrect model. For example, if a model is used to select a subset of a mailing list, do a test mailing to verify the model. If a model is used to predict credit risk, try the model on a small set of applicants before full deployment. TranSys analyzes the risk associated with an incorrect model first. The higher the risk associated with an incorrect model, the more importance is given to construct an experiment to check the model results. 7. Results Deployment Phase Once a data mining model is built and validated, the deployment of the results to applications within the enterprise may be done. For example, the clusters identified by the model can be used to extract the rules that define the model and make recommendations for the new observations. TranSys may aid in developing the application where the model is embedded into the application. For example, the business rules component (out of the model) can be integrated with a loan application system to facilitate evaluation of an applicant. Knowledge Discovery in Databases (KDD) The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems. The knowledge integration has been achieved to a fairly large extent in the e-business domains compared to other domains. For example, the discovered knowledge relating to the behavior of the customers (essentially from the click-stream data) has been effectively used to improve the sites, personalize the pages, improve the promotional and other features, and enhanced the buying experience. 5
The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems. The value of the discovered knowledge lies in its appropriate use. Focus should be on the utilization of data and knowledge for strategic use that can provide a competitive edge. TranSys, with its domain experience and practice excellence, can help clients to minimize the risks of running the KDD processes. Data 6 TranSys Technologies