MASTER'S THESIS 2009:097 Mining Changes in Customer Purchasing Behavior - a Data Mining Approach Samira Madani Luleå University of Technology Master Thesis, Continuation Courses Marketing and e-commerce Department of Business Administration and Social Sciences Division of Industrial marketing and e-commerce 2009:097 - ISSN: 1653-0187 - ISRN: LTU-PB-EX--09/097--SE
Abstract: The world around us is changing all the time. For businesses, knowing what is changing and how it has changed is also crucial. One of the most important aspects of surviving in a dynamic market is to know and adapt to changes happening in customer behavior. In Fast Moving Consumer Goods (FMCG) Distribution Company, this issue has more importance. Because of the variety of FMCGs products, distribution companies and their different strategies, the purchasing behavior of customers may change many times during a period and the competition become tougher. The purpose of this study is to help Kalleh Company as a manufacturer and distributor of food products in Iran market to mine changes happening in their customer behavior. Mining changes has several steps includes data collection, data preprocessing, customer segmentation, mining customer behavior patterns and change mining. For customer segmentation, we use Customer Value Matrix. For mining pattern of behavior, we use Apriori algorithm and maximal frequent itemsets. We have different kinds of changes based on the literature, added/ rules, emerging pattern and unexpected changes. Also, there are two measures of similarity and unexpectedness to measure the change. In this study, one time we calculate changes based on these measures from the literature. Then, we modified these measures to calculate the difference between ordinal attribute to bring their information in the calculation of changes. Our contribution is modifying these change measure to bring more information and higher accuracy in change mining. The result presented in the chapter4. Marketing managers can apply these detected changes to be responsive accurately and timely to the changes in the market. In addition, they can use it to evaluate different marketing campaigns to build stronger relationship with their customer and knowing the market better. There are many implications for mining changes in macro in micro aspects of businesses and also in marketing campaigns and manufacturing.
Abstract:... 1 Chapter1: Introduction... 8 1.1Background of the study:... 9 1.2Problem definition:... 10 1.3Purpose of this study:... 10 1.4Research question:... 12 1.5Research motivation:... 13 1.6Demarcation:... 13 1.7Research outline:... 13 Chapter2: Literature Review... 14 2. 1Mining Customer Behavior:... 15 2. 2Review of Data Mining... 15 2. 2.1 Data mining: in brief... 15 2. 2.2 Data mining Functions:... 17 2. 2.3 Classification in brief:... 18 2. 2.4 Clustering in brief:... 18 2. 2.5 Association Rules in Brief:... 19 2. 3Association Rule Mining Review:... 19 2.3.1 Association Rule mining problem:... 19 2.3.2 Apriori Algorithm... 20 2.3.3 Association Rule Mining Approaches: Apriori Approach... 26 2. 4Mining Changes Literature Review:... 30 2. 5Customer segmentation review:... 37 2. 5.1 Clustering Analysis... 38 2. 5.2 Customer Segmentation Model... 38 2. 5.3 RFM Model... 38 2. 5.4 RFM Scoring... 39 2.5.5 Customer Value Matrix Model... 43 2
Chapter3: Research Methodology... 45 3.1Research Methodology:... 46 3.2Research Design:... 46 3.3Research Purpose:... 46 3.4Research Approaches:... 48 3.5Research Strategy:... 48 3.6Research process:... 49 3.7Data Collection and Description:... 50 3.8 Data Pre-Processing:... 53 3.9Customer Segmentation:... 56 3.9.1 Customer Value Matrix... 57 3.9.2 An effective analytical tool... 57 3.9.3 Customer Value Matrix Methodology... 58 3.10 Mining Customer Behavior:... 60 3.10.1 Association Rule Mining:... 60 3.10.2 Apriori algorithm:... 61 3.11 Change Mining:... 63 3.11.1 Change Mining:... 63 Chapter4: Results & Analysis... 70 4.1 Data preprocessing result:... 71 4.1.1 Data Cleaning... 71 4.1.2 Data Transformation result:... 71 4.2Customer segmentation (in sql server 2000... 72 4.2.1 Customer Value Matrix Result:... 72 4.3Customer Behavior Mining:... 75 4.3.1 Discretization Result:... 75 4.3.2 Association Rule Mining Results:... 78 4.4Change Mining:... 78 3
4.4.1 Some examples of change pattern:... 79 4.4.2 Association rules and changes based (Chen et al, 2005):... 80 4.4.3 Rules with discrete variables in RHS:... 97 4.4.4 Change mining with Manhattan distance... 103 Chapter5: Conclusion, further research... 123 5.1Conclusion:... 124 5.2Our contribution:... 126 5.3Limitation:... 126 5.4Managerial Implication:... 126 5.5Future works:... 127 References:... 127 List of tables Table 2.1: Factors for classification of ARM..25 Table 2.2: Mining in a changing environment timetable 37 Table3.1: Data collected from Kalleh Company 52 Table3.2: calculating variables for customer value matrix 58 Table 4.1: RFM table fields.72 Table 4.2: calculating variables for customer value matrix...73 Table 4.3: calculating variables for customer value matrix...73 Table 4.4: segment information in for period 1..74 Table 4.5: segment information in for period 2..75 Table4.6: R quantile 76 Table4.7: M quantile...76 Table4.8: F quantile 77 4
Table4.9: Area quantile..78 Table 4.10: Generated rule summary.78 Table 4.11: Generated Rules for period 1 Cluster 1...80 Table4.12: Generated Rules for period 2 Cluster 1 81 Table4.13: Generated Rules for period 1 Cluster 2 82 Table4.14: Generated Rules for period 2 Cluster 2 84 Table4.15: Generated Rules for period 1 Cluster 3 87 Table4.16: Generated Rules for period 2 Cluster 3 88 Table4.17: Generated Rules for period 1 Cluster 4 89 Table4.18: Generated Rules for period 2 Cluster 4 95 Table4.19:Cat1 quantile..98 Table4.20:Cat2 quantile.. 99 Table4.21:Cat3 quantile... 100 Table4.22:Cat5 quantile 101 Table4.23:Cat11 quantile..102 Table4.24:Cat13 quantile..103 Table4.25: Generated Rules for period 1 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance 103 Table4.26: Generated Rules for period 2 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance 104 Table4.27: Generated Rules for period 1 Cluster 2, Change mining by (Chen et al, 2005) measures & Manhattan distance....105 Table4.28: Generated Rules for period 2 Cluster 2, Change mining by (Chen et al, 2005) measures & Manhattan distance..107 Table4.29: Generated Rules for period 1 Cluster 3, Change mining by (Chen et al, 2005) measures & Manhattan distance..109 5
Table4.30: Generated Rules for period 2 Cluster 3, Change mining by (Chen et al, 2005) measures & Manhattan distance....110 Table4.31: Generated Rules for period 1 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance...112 Table4.32: Generated Rules for period 2 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance...117 List of figures: Figure 2.1: Knowledge Discovery in Database Processes......16 Figure 2.2 the major steps in data mining process...17 Figure 2.3: Classification of DM techniques...17 Figure 2.4: Classic Problem of association rule mining.20 Figure 2.5: Mining in a changing environment review...36 Figure 2.6: Customer Value Matrix.44 Figure 3.1: Research design of this study 46 Figure 3.2: Change mining process perspective..49 Figure 3.3: Change mining process.50 Figure 3.4: Change mining process in detail...50 Figure 3.5: Product categories of Kalleh company.52 Figure 3.6: customer value matrix..59 Figure 4.1: generalized product category...71 Figure 4.2: The Customer Value Matrix...74 Figure4.3: R histogram....76 Figure4.4: M histogram..76 Figure4.5: F histogram 77 Figure4.6: Area histogram..78 Figure4.7: Cat1 histogram...98 6
Figure4.8: Cat2 histogram..99 Figure4.9: Cat3 histogram 100 Figure4.10: Cat5 histogram...101 Figure4.11: Cat11 histogram 102 Figure4.12: Cat13 histogram...103 7
Chapter1: Introduction Background of the study Problem definition Purpose of this study Research question Research motivation Research demarcation Research outline 8
1.1Background of the study: The world around us changes continuously. Knowing and adapting to changes is an important aspect of our lives. For businesses, knowing what is changing and how it has changed is also essential (Liu et al, 2000). One of the most important aspects of surviving in a dynamic market is to know and adapt to changes happening in customer behavior. Moreover, in recent years, there has been the explosive growth in the amount of information (Min, S., H., Han, I., 2005). In general, Fast moving consumer goods (FMCG) distribution companies collected huge amount of data from their customers and their purchasing transactions. In this gathered data, we can find interesting hidden information about the customers and their behaviors. The traditional approach for marketing decision making for marketing promotions, campaigns and market research in FMCG distribution companies is to focus more on their internal expert opinions. These experts include the marketing managers and also sales managers who are in constant touch with their salespeople and merchandisers who bring them market information. However, this kind of decision making process ignores the customer data and their behaviors. Furthermore, in today s world where the market is highly competitive and products are overwhelming, customers face with various products and various providers with different marketing strategies (Hossein Javaheri, S., 2008). In such a situation, customer behavior changes all the time due to such a dynamic market (Chen et al, 2005). When the marketing manager became aware of some changes in the market by sales team; he/she does not have any idea about how and where to start understanding these changes and their reasons. It results to design a wide time-consuming and costly market research which its result maybe did not reach on time to the marketing department to react to these changes. Also in such a market, there are many promotion campaigns by company itself and competitors that it is difficult to analyze the effectiveness of them in the market. So, in the competitive environment, there is a need to mine customer data and their transactions to find changes in customer purchasing behavior which is an effective and efficient way to respond to their needs timely and accurately. As a result, many FMCG distribution companies in Iran are trying to move away from traditional way 9
for planning their marketing campaigns, promotions and market research by understanding changes happening in their customers purchasing behavior. Change mining helps managers to make better marketing strategies. 1.2Problem definition: Kalleh Company is a private manufacturer and distributor of food product in Iran. It produces different categories of food product from dairy products to ice cream and meats and sauces. It has more than 10 different categories and about 800 products. Now, the company is faced with the challenge of increasing competition. There are some reasons behind it. First, according to the high variation of products, it should compete in different food market like dairy, ice cream and meat. It results to compete with many competitors with different product categories and different marketing strategies. Also there are some powerful governmental companies that make competition tougher for Kalleh. So in such a market, the customer behavior may change by the of companies strategies in the market and also by changing their need by themselves. Kalleh Company in order to answer to the changes in customer purchasing behavior timely and not being behind the customer needs and the competition need to mine changes in the customer purchasing behavior. The goal of Kalleh Company is to mine changes in purchasing behavior of the customers in different segments to respond to these changes timely and accurately to increase its return on investment (ROI). 1.3Purpose of this study: The purpose of this study is to mine changes in customer purchasing behavior. In order to reach this goal we need to building customer purchasing patterns of customers based on the customer, product and transaction data collected in databases. Data mining techniques can help us to reach this goal. According to (Song et al, 2001), data mining is the process of exploration and analysis of large quantities of data in order to discover meaningful pattern and rules. Many of data mining Studies has focused on developing techniques to build precise models to predict customer s behavior, and to set up marketing strategies and customization. According to (Nemati & Barko, 2001; cited by Nemati, H.R., Barko, C. D., 2003), most of data mining applications (72%) are centered on predicting customer 10
behavior. Comparatively little attention has been paid to discover changes in databases collected eventually (Liu et al., 2000). From literature review, what is obvious is too much time spent on worrying about absolute numbers, like Lifetime Value. However, what they should really be observing is relative numbers change over time. Highest potential ROI customers from a marketing viewpoint are Customers who are in the process of changing their behavior either accelerating their relationship with you, or ending their relationship with a company (Novo, j., 2008). In many applications, mining changes can be more crucial than producing precise prediction models, which are in the center of existing data mining researches. Regardless of how the model is accurate, it is inactive by itself because it can only predict based on patterns mined in the old data. Acting based on the built model should not guide to actions that may change the environment because otherwise the model will stop to be correct (Liu et al., 2000). Prediction model building is more appropriate in areas where the environment is comparatively steady. However, in many business conditions, constant human interference to the environment is a fact. Businesses simply cannot let nature take its course. They constantly need to do actions in order to provide better services and products by finding the attractive changes and steady patterns in customer behaviors. Still in a comparatively steady environment, changes are also unavoidable due to internal and external issues (Liu et al., 2000). From these viewpoints the question: Which patterns exist? as it is responded by state-of-the art data mining technology, is replaced by the question: How do patterns change? (Böttcher, M., et al, 2006). Actually, discovery of interesting and earlier unidentified changes in customer, product and transaction data, not only let the user monitor the influence of past business decisions but also to get ready today s business for tomorrow s needs (Böttcher, M., et al, 2006). Major changes often need instant concentration and actions to modify the existing practices and/or to change the domain condition (Liu et al, 2000). By using change mining methodology, Kalleh Company can detect different kinds of changes happening in the customer purchasing behavior to build stronger relationship with the customers. Also, understanding changes in customer behavior can assist managers to set up effective and efficient promotion campaigns. (Liu et al, 2000) mentioned that there are two main goals for mining changes in a business environment: 11
"To follow the s": The main feature of this kind of applications is the word "follow". Companies like to know where the is going not to be left behind. They need to investigate customers' changing behaviors so as to provide products and services that suit the changing needs of the customers. "To stop or to delay undesirable changes": In this kind of applications, the keyword is "stop". Companies like to know undesirable changes as soon as possible and to plan corrective measures to stop or to delay the pace of such changes. The overall procedure consists of several steps. In the literature, there are some methods for change mining in the dynamic situation. According to (Song et al, 2001), the majority of data mining techniques like association rules and neural networks cannot be used alone because they cannot manage dynamic situation well. (Song et al, 2001) and (Chen et al, 2005) developed a methodology for mining changes. They used association rule to detect interesting association relationships among a large set of data items which introduced by (Agrawal et al., 1993). The methodology detects all kinds of changes. According to (Chen et al, 2005), Change mining has several steps including data preprocessing, customer segmentation, mining association rule and change mining. In the first customers are segmented based on their behavioral variables, recency, frequency and monetary (RFM). Then by building association rule with customer behavioral variable (RFM), customer data and transaction data, we describe the customer purchasing behavior in two different time snapshots, and in the end we compare generated rules for each segment to mine changes in the customer purchasing behavior. To mine changes, various algorithms and techniques should be used. In order to implement these algorithms and techniques, an extensive programming is needed. Finally, we combined all of the algorithms to build a change mining package. 1.4Research question: Based on the problem discussion that we have above, the purpose of this study is to mine changes in customer purchasing behavior. In order to reach this purpose, the research question will be as followed: How businesses can be responsive to the changes of customer behavior in dynamic market. In addition, how businesses can detect and access to the changes happened in the customer behavior pattern to be responsive accurately and timely. 12
1.5Research motivation: Recently, we have watched an explosion of data produced and collected by individuals and organizations. This fast growth in data and databases made the problem of data overload (Li, X. B., 2005). More recently, increased computing power has led to greater elasticity in the models one can use and the amount of data that can be stored and processed (Bolton, R. J., 2004) and as a result, data mining techniques have came out and flourished in the past several years to encounter this demand (Li, X. B., 2005). Organizations are starting to understand the importance of data mining in their marketing strategies. In this situation, businesses currently face the challenge of a constantly evolving market where customer needs are changing all the time (Chen et al, 2005). In such an environment, knowing the changes and responding rapidly and correctly to them, has a high importance. While customer needs change over time, if businesses could not meet their need, they would lose their customers who are their ROI resources. Some works have been done in change mining in retailing. One of the businesses that change mining can help it to improve, is FMCG distribution business that face a dynamic markets by huge variation of products and competitors in the market. The purpose of the change mining is following the s that are happening in the customer purchasing pattern, detecting the changes and respond to them timely to satisfy customers more and meet their needs. 1.6Demarcation: This study focus on mining changes in customer purchasing behavior based on the customer and purchasing transaction stored in a database. Change mining has been done by data gathered from a database of FMCG Distributor Company in Iran. Most of the literature reviewed is about mining changes in customer purchasing behavior. Our work focus on building customer behavior patterns by association rule mining and the comparison of these built rules. These patterns just based on their previous transactions. 1.7Research outline: This thesis consists of five chapters. The first chapter is introduction that gives a brief background about subject followed by research question, objectives, and motivation. Chapter 2 is a literature review, consists literature review on data mining, association rule, change mining and customer segmentation. Chapter3 is about our research methodology including data preprocessing, market segmentation, mining customer behavior and change mining. Chapter4 is about the results and analysis. Chapter 5 is the last chapter that contains conclusion, limitation, and further research. 13
Chapter2: Literature Review Review of Mining Customer Behavior Review of Data Mining Review of Association Rule Mining Review of Change Mining Review of Customer Segmentation 14
2. 1Mining Customer Behavior: Different methods to describe customer behavior exist in the literature. Among them, there are various types of conjunctive rules to build customer behavior pattern including association rules and classification rules (Agrawal R. et al, 1996 & Breiman L., et al., 1984 cited on Adomavicius, G., Tuzhilin, A., 2001) Using rules to describe customer behavior has certain advantages. Besides being descriptive way to portray behaviors, a conjunctive rule is a well-studied concept and it is used widely in data mining, expert systems, and many other areas. In addition, researchers have proposed many rule discovery algorithms in the literature, especially for association rules (Adomavicius, G., Tuzhilin, A., 2001). To discover rules that describe the behavior of customers, we can use various data mining algorithms, like Apriori for association rule mining. Association rules were initially applied for market basket analysis to find the relationships between product items purchased by customers at retail stores (Agrawal, Imielinski, & Swami, 1993; Srikant, Vu, & Agrawal, 1997 cited by Chen et al, 2005). In a research of customer behavior, we can apply association rule to find the correlations between customer demographic variables, purchased product and product databases (Song et al, 2001). In this chapter, we will have a review of data mining, then association rules. Then the next topic will be the change mining of customer behavior in the literature. And following by that finally we will have a brief review of customer segmentation. 2. 2Review of Data Mining 2. 2.1 Data mining: in brief Today, size of databases can be very large. Within this data you can find hidden strategic information. But when you have a huge amount of data, inducing meaningful conclusions is not easy. The novel answer is data mining being used both to increase revenues and to reduce costs. Many people use data mining as a synonym for another popular word, Knowledge Discovery in Database. In rotation other people define Data Mining as the core process of KDD. The KDD processes are shown in Figure 2.1 (Han, J., & Kamber, M., 2006). Usually KDD has three processes. First one is preprocessing executed before data mining techniques applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process. In this process different algorithm are applied to produce 15
hidden knowledge. The last process is post-processing comes evaluating the mining result according to users requirements and domain knowledge. Regarding the evaluation results, if the result is satisfactory the knowledge can be presented; else we have to run some or all of those processes again till we get the satisfactory result (Han, J., & Kamber, M., 2006). Figure 2.1: Knowledge Discovery in Database Processes (Song et al, 2001) defines data mining as a process of exploration and analysis of large quantity of data to discover meaningful patterns and rules. (Feelders et al, 2000) define the process of data mining as follows: 16
Source: (Feedlers et al, 2000) Figure 2.2 the major steps in data mining Process The data mining returns potential is immense. Innovative organizations worldwide are already using data mining to attract higher-value customers, to configure their product offerings differently to increase sales, and to minimize losses due to mistakes or fraud. 2. 2.2 Data mining Functions: (Dunham, 2002) categorizes data mining to two categories, one is descriptive and the other one is predictive (Figure 2.3). Source: (Dunham, 2002) Figure 2.3: Classification of DM techniques The first and simplest analytical step in data mining is to describe the data- 17
summarize its statistical attributes such as means and visual review like charts and graphs, and correlations among variables. The most important step is right data selection, data gathering and data exploration. Sometimes data description alone cannot provide an action plan. You must build a predictive model based on patterns determined from known results, and then examine that model with a new sample data. A good model should never be the same as reality, but it can be a useful guide to know your business. And after all we should empirically verify the model (Twocrows.com, 2005). In the next section, we explain briefly three important data mining techniques. 2. 2.3 Classification in brief: Based on (Han and Kamber, 2006), Classification is automatic model building that can classify a class of objects so as to predict the classification or missing attribute value of future objects whose class may not be known. The process has 2 steps. In the first step, a model is built to describe the characteristics of a set of data classes or concepts based on the collection of training data set. Because data classes or concepts are predefined, this step is also known as supervised learning. In the second step, the model is used to predict the classes of future data or objects. There are several techniques for classification (Han and Kamber, 2006). In Classification by decision tree many researches are done and plenty of algorithms have been designed, Murthy did a extensive survey on decision tree induction (Murthy, 1998; cited by Han, J., & Kamber, M., 2006). Bayesian classification is another technique that can be found in (Duda and Hart, 1973 cited by Han, J., & Kamber, M., 2006). Nearest neighbor methods are also talked about in many statistical texts on classification, such as (Duda and Hart, 1973, cited by Han, J., & Kamber, M., 2006) and (James, 1985, cited by Han, J., & Kamber, M., 2006). Besides, there are many other machine learning and neural network techniques used to help building the classification models. 2. 2.4 Clustering in brief: As we mentioned before, classification can be taken as supervised learning process, clustering is another mining technique similar to classification. However clustering is an unsupervised learning process. "Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects" (Han, J., & Kamber, M., 2006), so that objects within the same cluster must be similar to some extend, also they should be dissimilar to those objects in other clusters. In classification each record belongs to a predefined class, while in clustering there is no predefined class. In clustering, objects are grouped together based on their similarities. (Han, J., & Kamber, M., 2006)Similarities 18
between objects are explained by some similarity functions; usually similarities are quantitatively defined as distance or other measures by corresponding domain experts. (Han, J., & Kamber, M., 2006) Most clustering applications are used in market segmentation. When they cluster their customers into different groups, business organizations can provide different personalized services to different group of markets. (Han, J., & Kamber, M., 2006) An extensive survey of current clustering techniques and algorithms is available in (Berkhin, 2002; cited by Han, J., & Kamber, M., 2006). 2. 2.5 Association Rules in Brief: Association rule mining is one of the most important techniques of data mining. (Agrawal et al, 1993) introduced this method first time. The goal of this technique is extracting interesting correlations, frequent patterns, and associations among sets of items in the transaction databases or other data reservoirs (Agrawal et al, 1993). Association rules are used extensively in various areas. In this study we will use association rule to mine customer behavior pattern to find behavioral changes. In the next section, we will have a review of association rule mining. 2. 3Association Rule Mining Review: 2.3.1 Association Rule mining problem: In this section, we will introduce association rule mining problem in detail. A typical association Rule has an implication of the form A B where A is an itemset and B is an itemset that contains only a single atomic condition (Berry & Linoff, 2004). There are two definitions to evaluate each association rule. The support of an association rule is the percentage of records containing both A and B and the confidence of a rule is the percentage of records containing itemset A that also contain itemset B. The support shows the usefulness of a discovered rule and the confidence shows certainty of found association Rules (Berry & Linoff, 2004). We can calculate another variable called Lift. It Measures the difference between confidence and expected value of confidence for a rule. (Berry & Linoff, 2004) define Lift (also called improvement), as a measure telling us how much better a rule is at forecasting the result than just assuming the result in the first place. Lift is the ratio of the density of the target after application of the left-hand side to the density of the target in the population (Berry & Linoff, 2004). Another way of saying this is that lift is the ratio of the records that support the entire rule to the number that would be expected, assuming that there is no relationship between the products (the exact formula is givenlater in the chapter) (Berry & Linoff, 2004). 19
2.3.2 Apriori Algorithm Association rule mining is discovering association rules that satisfy the pre-defined minimum support and confidence from a database (Agrawal, R., & Srikant, R., 1994). According to (Agrawal, R., & Srikant, R., 1994), this problem is usually decomposed into two sub problems: One is to find those itemsets whose occurrences surpass a predefined threshold in the database; those itemsets are called frequent or large itemsets. This problem can be later divided into 2 sub problems: candidate large itemsets generation and frequent itemsets generation process. Large or frequent itemsets are those itemsets whose supports surpass the support threshold as and candidate itemsets are those itemsets that are expected or have the hope to be large or frequent. The second problem is producing association rules from those large itemsets with the limits of minimal confidence. You can see the whole process of standard problem of mining association rules in figure 2.3. Source: (Agrawal et al, 1993) Figure 2.4: Classic Problem of association rule mining The whole performance of mining association rules is determined mainly by the first step (Agrawal, R., & Srikant, R.). After the large itemsets are found, the corresponding association rules can be derived in a straightforward manner. the focus of most mining algorithms is counting of large itemsets Efficiently, and many efficient solutions have been designed to target previous criteria (Kantardzic.M, 2003). 20
Different kinds of produced AR: One attraction of association rules is the clarity and utility of the results, which are in the form of rules about groups of products. There is a spontaneous attraction to an association rule because it shows how tangible products and services group together (Berry & Linoff, 2004). While association rules are easily understandable, they are not always useful (Berry & Linoff, 2004). There are 3 types of generated association rules: Actionable rules, trivial rules and inexplicable rules. Actionable rules are the useful rule holds high-quality, actionable information. Once the pattern is found, it is not often hard to justify, and thinking about rule in the real environment can lead to insights and actions. Because the rule is easily understood, it recommends plausible causes and possible interventions (Berry & Linoff, 2004). Another type of association rule is trivial rules.. Many people in business know trivial results. Although it is valid and well supported in the data, it is still not practical. A simple example is customers purchasing hamburgers buy hamburger buns. A subtler problem drops within the same category. An apparently interesting result may be the result of past marketing programs and product bundles. Although other data mining techniques have this problem but market basket analysis is vulnerable to reproducing the success of prior marketing campaigns because of its dependence on un-summarized point-of-sale data, exactly the same data that defines the success of the campaign. Trivial rules have one advantage and that is when a rule should appear 100 percent of the time, the few cases where it does not hold supply a lot of information about data quality. An area where business operations, data collection, and processing may need to be more refined indicates the exceptions to trivial rules (Berry & Linoff, 2004). Inexplicable results seem to have no interpretation and do not recommend a course of action. There is a caution and that is when applying market basket analysis, many of the results are often either trivial or inexplicable; trivial rules reproduce common knowledge about the business, which waste the effort used to apply complex analysis techniques and Inexplicable rules are flukes in the data and are not actionable (Berry & Linoff, 2004). ARM Approaches Classification: Association rule mining is a well studied research area; in this section, we will only review some basic and classic approaches for association rule mining. As 21
mentioned before, the second sub-problem of ARM is straightforward; most of those approaches focus on the first sub-problem. As mentioned, the first sub-problem can be further divided into two sub-problems: candidate large itemsets generation process and frequent itemsets generation process. Most of the algorithms of mining association rules that surveyed are quite similar, the difference is the extent to which specific improvements have been made. According to (Zhao, Q., Bhowmick, S.S., 2003), there are 3 milestones in ARM classic problem; Apriori approach, tree structure approaches and special issues in ARM. Besides these approaches, there is another approach from (Zaki et al, 1999); class-based algorithms approach. There some features that exists in literature to classify ARM algorithms by different aspect. In the following subsection we will see some of them. Here there are some features, which can be used to classify the algorithms. We can categorize the algorithms based on several basic features that try to best differentiate the various algorithms. These are different features that we have found in literature (summarized in Table 2.1): Target: Basic association rule algorithms actually find all rules with the acceptable support and confidence thresholds. However, there are some more efficient algorithms could be used. One approach which has been done to do this is adding constraints on the rules which have been produced. Algorithms can be categorized as complete (All association rules satisfying the support and confidence are found), constrained (Some subset of all the rules are found, based on a technique limiting them), and qualitative (A subset of the rules are produced based on additional measures, beyond support and confidence, need to be satisfied) (Dunham M.H., et al, 2001). Type: Here we show the type of association rules which are produced (for example regular (Boolean), spatial, temporal, generalized, qualitative, etc.) (Dunham M.H., et al, 2001). Data type: Besides data stored in a database, the type of data also is important. Association rules of a plain text might be very important information to find out. For example, data, mining, and decision may be highly dependent in a paper of knowledge discovery (Dunham M.H., et al, 2001).. Data source: In addition to market basket data, association rules of data absent in the database might play important role for decision purposes of a company (Dunham M.H., et al, 2001). 22
Technique: All approaches to date are based on first finding the large itemsets. There could, of course, be other techniques not requiring that large itemsets first be found. Although to date we are not aware of any techniques not generating large itemsets, certainly this possibility does exist with the potential of improved performance. However, (Agrawal et al, 1998) cited in (Dunham M.H., et al, 2001) proposed strongly collective itemsets to evaluate and find itemsets. The term support and confidence are completely different from large itemset approach. An itemset I is said to be strongly collective at level K if the collective strength C (K) of I as well as any subset of I is at least K (Dunham M.H., et al, 2001). Itemset Strategy: Different algorithms consider the generation of items differently. This feature shows how the algorithm considers transactions as well as when the itemsets are produced. One technique, Complete, could produce and count all potential itemsets. The most common approach is that introduced by Apriori. With this strategy, a set of itemsets to count is produced prior to scanning the transactions. This set remains constant during the process. A dynamic strategy produces the itemsets during the scanning of the database itself. A hybrid technique generates some itemsets prior to the database scan, but also adds new itemsets to this counting set during the scan (Dunham M.H., et al, 2001). Transaction Strategy: Different algorithms consider the set of transactions in a different manner. This feature shows how the algorithm scans the set of transaction. The complete strategy checks all transactions in the database. With the sample approach, some subset of the database (sample) is checked prior to processing the complete database. The partition techniques divide the database into partitions. The scanning of the database requires that the partitions be checking individually and in order (Dunham M.H., et al, 2001). Itemset Data Structure: As itemsets are produced, different data structures can be applied to keep track of them. The most usual approach seems to be a hash tree. Alternatively, a trie or lattice may be applied. At least one technique suggests a virtual trie structure where only a portion of the complete trie is actually materialized (Dunham M.H., et al, 2001). Transaction Data Structure: "Each algorithm assumes that the transactions are stored in some basic structure, usually a flat file or a TID list" (Dunham M.H., et al, 2001). Optimization: Many algorithms have been introduced improving on earlier 23
algorithms by applying an optimization strategy. Various strategies have considered optimization based on available main memory, whether or not the data is skewed, and pruning of the itemsets to be counted (Dunham M.H., et al, 2001). Architecture: As indicated, the goal of some algorithms is working like sequential function in centralized single processor architecture. Alternatively, algorithms have been designed to work in a parallel manner suitable for a multiprocessor or distributed architecture (Dunham M.H., et al, 2001). Parallelism Strategy: Parallel algorithms can be more described as task or data parallelism (Dunham M.H., et al, 2001). In the literature there some other features that based on them also we can categorize the association rule mining methods; in the following we can consider them: Counting Strategy: This refers to the methods used in counting the candidate itemsets occurrences. There horizontal counting and vertical intersection are two main approaches. The horizontal counting decides about the support value of a candidate itemset by scanning transaction singly, and increasing the counter of the itemset if it is a subset of the transaction. This approach operates well for a rarely occurred candidate because only those transactions containing that itemset need to be checked. The candidate look up operation, however, is very expensive for candidates of large size (Su, J. H., Lin, W. Y., 2004). On the other hand, vertical intersection is applied when the database is in a vertical format such that every record is associated with an item to store the identifiers of the transactions containing that item, called Tidlist. Despite the vertical intersection scheme omits the I/O cost for database scan, it has the following shortage: when a candidate itemset has a support count completely less than the number of transactions, a large amount of unnecessary intersections happens there (Su, J. H., Lin, W. Y., 2004). Search direction: according to (Su, J. H., Lin, W. Y., 2004), there are two main methods for search direction, Bottom-up traversal and Top-down traversal. Today, most Apriori-like approaches apply bottom-up traversal of the search space, which starts from all frequent 1-itemsets upward to the longest frequent itemsets. The most important advantage of this model is that it can effectively prune the search space by exploiting downward closure property: when it recognized one itemset as infrequent, all of its superset is also infrequent. However, this benefit fades when most of the maximal frequent itemsets locating near the largest itemset of the search 24
lattice, due to a comparatively small support threshold. In this situation, there are very few itemsets to be pruned (Su, J. H., Lin, W. Y., 2004). Another itemset traversal method is Top-down traversal which applied in the opposite direction, i.e. starting from the longest itemsets downward to the frequent 1-itemsets, or top-down for short (Su, J. H., Lin, W. Y., 2004). This strategy is traditionally applied for discovering maximal frequent itemsets (Tseng, M.C. & Lin, W.Y., 2001; cited by Su, J. H., Lin, W. Y., 2004) But we should consider that though all of the frequent itemsets can be derived from their maximal ones, more counting strategies are needed to gain their exact supports for computing the confidences of association rules. At the same time, if there are huge numbers of items and/or the support threshold is very low; many infrequent itemsets have to be visited before the maximal frequent itemsets are identified. This is why most work on frequent itemsets mining accepts and applies the bottom-up paradigm instead. (Su, J. H., Lin, W. Y., 2004). Search strategy: While the search direction directs the way that the search space is exploited, the search strategy identifies the order in which itemsets are visited (Su, J. H., Lin, W. Y., 2004). One of these strategies is BFS. Most Apriorilike algorithms apply breadth-first search (BFS) because it can facilitate the pruning of candidates with downward closure. This strategy, however, needs more memory to keep the frequent subsets of the pruned candidates (Su, J. H., Lin, W. Y., 2004). Another strategy is DFS; recursively visiting the descendants of an itemset. In the literature, this strategy is usually combined with the counting strategy of vertical intersection because it is enough to keep in memory the tidlists corresponding to the itemsets on the path from the root down to the presently inspected one. (Su, J. H., Lin, W. Y., 2004) Table 2.1: Factors for classification of ARM VALUES Complete, Constrained, Qualitative Regular (Boolean), Generalized, Quantitative, etc. Database Data, Text Market Basket, Beyond Basket DIMENSION Target Type Data type Data source 25
Large Itemset, Strongly Collective Itemset Complete, Apriori, Dynamic, Hybrid Complete, Sample, Partitioned Hash Tree, Trie, Virtual Trie, Lattice Flat File, TID Memory, Skewed, Pruning Sequential, Parallel None, Data, Task Sequential Pattern, Frequent Itemset, Structured Pattern Association Rule, Strong gradient relationship, correlation(han book) Horizontal, Vertical Bottom-Up Traversal, Top-Down Traversal, Hybrid BFS, DFS Complete, Heuristic Technique Itemset Strategy Transaction Strategy Itemset Data Structure Transaction Data Structure Optimization Architecture Parallel Strategy Pattern Kind Rule Kind Counting Strategy Search Strategy Search Direction Candidate generation 2.3.3 Association Rule Mining Approaches: Apriori Approach AIS Algorithm: The AIS (Agrawal, Imielinski, Swami) algorithm was the first algorithm suggested for mining association rule in (Agrawal et al, 1993). It concentrates on improving the quality of databases simultaneously with necessary functionality to process decision support queries. According to (Zhao, Q., Bhowmick, S.S., 2003), in this algorithm only one item consequent association rules are produced. It means that the consequent of those rules only contain one item, for example we only 26
produce rules like ABC D but not rules like AB CD. Disadvantage: The main disadvantage of the AIS algorithm is too many candidate itemsets that at last turned out to be small are produced, needing more space and wastes much effort that turned out to be useless. At the same time this algorithm needs too many passes over the whole database (Zhao, Q., Bhowmick, S.S., 2003). 3.3.3.3 Apriori Algorithm: Apriori is a great improvement in the history of association rule mining, Apriori algorithm was first introduced by Agrawal in (Agrawal, R., & Srikant, R., 1994). The AIS is just a straightforward approach that needs many passes over the database, which produces many candidate itemsets and saving counters of each candidate while most of them turn out to be not frequent. Apriori is more efficient during the candidate generation process for two reasons; Apriori applies a different candidate's generation method and a new pruning technique (Zhao, Q., Bhowmick, S.S., 2003). a) Problem & limitation of Apriori: One is the complex candidate generation process that spends most of the time, space, and memory. Another bottleneck is the several scan of the database. Many new algorithms were designed with some modifications or improvements based on Apriori algorithm. Commonly, there were two approaches: First approach tries to reduce the number of passes over the whole database or replace the whole database with only part of it based on the current frequent itemsets. The other approach tries exploring different types of pruning techniques to make the number of candidate itemsets much lesser. Apriori-TID and Apriori- Hybrid (Agrawal, R., & Srikant, R., 1994), DHP (Park et al, 1995; cited by Zhao, Q., Bhowmick, S.S., 2003), SON (Savesere et al, 1995) are modifications of the Apriori algorithm (Zhao, Q., Bhowmick, S.S., 2003). 3.3.3.4 Optimized Apriori algorithms: According to problems of Apriori, which have been mentioned in previous section, some new approaches are introduced. In the following section we will have them; item pruning and database passes over reduction. a) Transaction and Item Pruning: This is one of the main optimization of the Apriori Algorithm. There is no 27
need to inspect the whole database each time it is needed to count occurrence of candidate itemsets. This optimization reduced drastically the needed time to count the support for the candidate sets and enhances the performance. Transaction pruning was present in 2 algorithms; AprioriTid, Apriori Hybrid and DHP. AprioriTID, Apriori Hybrid: AprioriTID was introduced in the same paper with Apriori. For all that, it does not state it explicitly, it uses transaction pruning to improve Apriori performance. The main difference comes from where it does not use the whole database to count support for candidate sets, and it uses another approach (Ayad, A. M., 2000).The main disadvantage of this algorithm is the size of the alternative set that shows the database may go beyond the size of the actual database in early stages thus loosing its edge on Apriori. Because of this disadvantage another algorithm, Apriori Hybrid introduced. It uses Apriori at the first stages and then shifts to AprioriTID when transaction pruning is more effective (Ayad, A. M., 2000). DHP: DHP (Dynamic Hashing and Pruning algorithm) is another algorithm that introduced by (Park et al, 1995). It uses probabilistic counting to decline the number of candidate itemsets counted during each round of Apriori execution. This decline is completed by subjecting each candidate k itemset to a hash-based filtering step in addition to the pruning step (Ayad, A. M., 2000). Throughout candidate counting in round k -1, the algorithm builds a hash table. Each entry in the hash table is a counter that retains the sum of the supports of the k-itemsets that correspond to that exacting entry of the hash table. The algorithm uses this information in round k to prune the set of candidate k-itemsets. After subset pruning as in Apriori, the algorithm can remove a candidate itemset if the count in its hash table entry is smaller than the minimum support threshold. According to, It has 2 advantage, first this algorithm is also based on the monotone Apriori property, where a hash table is built for the purpose of reducing the candidate space by pre-computing the proximate support for the k+1 item set while counting the k-itemset. DHP has another important advantage, the transaction trimming, which has been applied by removing the transactions that do not contain any frequent items. However, this trimming and the pruning properties caused some problems that made it impractical in many cases (Ayad, A. M., 2000). 28
b) Reducing the number of database passes: As mentioned before, the main disadvantage of the classical Apriori is the several passes it has to do on the databases the number of which is equal to the length of the longest frequent itemset (Pattern) present in the database(zhao, Q., Bhowmick, S.S., 2003). Many optimization efforts focused on eliminating the number of database passes. They differed, however, in how the number of passes decreased. This is the focus of this section. Database Partitioning: (Savasere et al, 1995) developed Partition, an algorithm that requires only two scans of the transaction database. The database is divided into disjoint partitions, each small enough to fit in memory. In a first scan, the algorithm reads each partition and computes locally frequent itemsets on each partition using Apriori. In the second scan, the algorithm counts the support of all locally frequent itemsets toward the complete database. If an itemset is frequent with respect to the complete database, it must be frequent in at least one partition; therefore, the second scan counts a superset of all potentially frequent items. The main achievement of Partition is the reduction of database activity. It was shown that this reduction was not obtained at the expense of more CPU utilization. It was shown however, that the number of partition greatly affects the performance of the algorithm by affecting the number of locally frequent itemsets that turn to be globally infrequent. The algorithm was shown to be vulnerable to data skew (Ayad, A. M., 2000). Dynamic itemset counting: (Brin et al, 1997) proposed the Dynamic Itemset Counting algorithm. DIC partitions the database into several blocks marked by start points and repeatedly scans the database. In contrast to Apriori, DIC can add new candidate itemsets at any start point, instead of just at the beginning of a new database scan. At each start point, DIC estimates the support of all itemsets that are currently counted and adds new itemsets to the set of candidate itemsets if all its subsets are estimated to be frequent (Brin et al, 1997). If DIC adds all frequent itemsets and their negative border to the set of candidate itemsets during the first scan, it will have counted each itemset s exact support at some point during the second scan; thus DIC will complete in two scans (Ayad, A. M., 2000). The Dynamic Item set Counting (DIC) 29
reduces the number of I/O passes by counting the candidates of multiple lengths in the same pass. DIC performs well in cases of homogenous data, while in other cases DIC might scan the databases more often than the Apriori algorithm. c) Both Sampling: (Toivonen, H., 1996) proposed a sampling based algorithm that typically requires two scans of the database. The algorithm first takes a sample from the database and generates a set of candidate itemsets that are highly likely to be frequent in the complete database. In a subsequent scan over the database, the algorithm counts these itemsets exact supports and the support of their negative border. If no itemset in the negative border is frequent, then the algorithm has discovered all frequent itemsets. Otherwise, some superset of an itemset in the negative border could be frequent, but its support has not yet been counted. The sampling algorithm generates and counts all such potentially frequent itemsets in a subsequent database scan (Toivonen, H., 1996). The algorithm was shown to perform well compared to other level-wise algorithms and to be the Partition algorithm. The database activity is reduced effectively one pass. The only drawback of the algorithm, however, is that it has to test many spurious candidates due to the reduced support threshold and to guarantee a superset of the actual frequent itemsets (Ayad, A. M., 2000). Conclusion of Apriori Approach: Most of the algorithms introduced above are based on the Apriori algorithm and try to improve the efficiency by making some modifications, such as reducing the number of passes over the database; reducing the size of the database to be scanned in every pass; pruning the candidates by different techniques and using sampling technique (Zhao, Q. & Bhowmick, S.S., 2003). However, there are two bottlenecks of the Apriori algorithm: first bottleneck is the complex candidate generation process that uses most of the time, space, and memory and the other bottleneck is the multiple scan of the database, Apriori is used in many applications for building patterns in large databases 2. 4Mining Changes Literature Review: As it is mentioned in chapter of customer behavior analysis, mining changes has very important role in business strategies and marketing. In this chapter, we are going to briefly review the literature related to mining changes in customer buying behavior patterns and determine the position of my work within these researches. In the past, researchers generally 30
applied statistical surveys to study customer behavior. Recently, however, data mining techniques have been adopted to describe and predict customer behavior (Giudici & Passerone, 2002; Song, Kim, & Kim, 2001 cited by Song et al, 2001). There are some related works to mining in a dynamic environment; of course, they are not as much as work has been done in customer behavior modeling or prediction. Liu et al. (2000) devised a method of change mining in the context of decision trees for predicting changes in customer behavior. Since decision tree is a classification-based approach, it cannot detect complete sets of changes (Song et al, 2001). Association rule extraction was widely used for analyzing the correlation between product items purchased by customers, and to support sales promotion and market segmentation (Changchien & Lu, 2001; Changchien, Lee, & Hsu, 2004; cited by Song et al, 2001). (Song et al, 2001) employed an approach based on association rules to identify changes in customer behavior. (Chen et al, 2005) employed another approach to recognize changes in customer behavior by association rule mining methods. There are existing works that have been carried out on learning in a changing environment (Fruend and Mansour, 1997; cited by Song et al, 2001, Helmbold & Long, 1994; cited by Song et al, 2001; Widmer, 1996; cited by Song et al, 2001). There are some existing works in mining in a changing environment (Bay and Pazzami, 1999; Ganti, Gehrk, Ramakrishnan, 1999; Han, Kamber, 2001; Liu et al, 2000; Nakhaeizadeh, Taylor, Lanquillon, 1998; cited by Song et al, 2001). For example (Fruend and Mansour, 1997 cited in Song et al, 2001) presents a model of learning in a changing distribution. All the following related works focus on dynamic aspects or comparison between two different datasets or rules. They are clustered as six categories in this chapter. According to (Song et al, 2001) there are six groups of works in the area of data mining in changing environment. These are as follows: The first field of study that studies mining in a changing environment is rule maintenance (Cheung, Han, Ng & Wong, 1996a; Cheung, Ng, Tam, 1996b; Feldman, Aumann, Amir & manila, 1997; cited by Song et al, 2001, Thomas, Bodagal, Alsbati & Ranka, 1997) the purpose of these studies is improving accuracy in a changing environment. For example is the study that has been done by (Thomas et al, 1997) which proposed an incremental updating technique based on Negative 31
Borders, for the maintenance of the association rules when new transaction data is added to or eliminated from transaction database. An important aspect of this technique is it requires a full scan of database if the database update causes the negative border of the set of large itemsets to expand. But these techniques don t provide any changes to the user, they just maintain existing knowledge. The second research associated to our work is discovering emerging pattern (Agrawal, R. & Psaila, G., 1995; Dong, G., & Li, J., 1999; Li et al, 2000). These researches try to find emerging patterns (EPs) which are described as itemsets whose supports boosted significantly from one dataset to another. (Agrawal, R. & Psaila, G.) established Active data mining paradigm which is in this paradigm, data is continuously mined at a desired frequency. As rules are discovered, they are added to a rulebase, and if they already exist, the history of the statistical parameter associated with the rules is updated. When the history begins exhibiting certain s, specified as shape queries in the user-specified triggers, the triggers are fired and appropriate actions are initiated. (Dong, G., & Li, J., 1999) introduced the datamining problem of emerging patterns (EPs). (Li et al, 2000) proposed the use of jumping emerging patterns (JEPs) as the basis for a new classifier called the JEP- Classifier. Each JEP can capture some crucial difference between a pair of datasets. Then, aggregating all JEPs of large supports can produce more potent classification power. They use two algorithms for constructing the JEP-Classifier which are both scalable and efficient. These algorithms make use of the border representation to efficiently store and manipulate JEPs. EPs can capture emerging s in timestamped database, or useful contrast between data classes, but they don t consider the structural changes in the rules (Song et al, 2001). Another connected research is subjective interestingness in data mining (Liu & Hsu, 1996; Liu et al, 1997; Liu, et al, 1999; Padmanabhan & Tuzhilin, 1999, Silberchatz & Tuzhilin, 1996; Suzuki, 1997 cited by Song et al, 2001). These researches provide a number of techniques for finding unexpected rules regarding users existing knowledge. For example, (Liu & Hsu, 1996) tries to link the gap between the user and the rules created by an induction system. A fuzzy matching technique is recommended for rule comparison in the context of classification rules. It permits the user to compare the produced rules with his/her hypotheses or existing knowledge in order to find out what is right and what is wrong about his/her knowledge, and to tell what has changed since the last learning. This technique is also helpful in data mining for solving the interestingness problem. (Liu et al, 1997 cited in Song et al, 2001) studies the problem of analyzing discovered rules next to a 32
particular form of existing concepts, namely general impressions (GIs). A specification scheme for representing GIs is proposed and two matching algorithms for analyzing discovered rules are presented. This technique is useful for solving the interestingness problem. (Padmanabhan & Tuzhilin, 1999 cited in Song et al, 2001) proposed a new definition of unexpectedness of a rule with respect to a belief and showed an algorithm that finds unexpected association rules from data using this measure. (Silberchatz & Tuzhilin, 1996 cited in Song et al, 2001) mentioned that Measures of interestingness of patterns in data mining applications can be categorized into objective and subjective and they classified subjective measures into unexpected and actionable and argued, at the intuitive level, that these two measures of interestingness are independent of each other. Although action ability emerges to be the major concept, we believe that it is a difficult notion to capture formally since they consider that most unexpected patterns are actionable and most actionable patterns are unexpected, in this paper, they proposed to capture action ability via unexpectedness. Consequently, they studied "unexpectedness" as a measure of interestingness and described interestingness of a pattern in terms of how strongly it "shakes" the existing system of beliefs. By this meaning they also make unexpected patterns more interesting than the expected ones. All of the above work study subjective measures of interestingness, but these techniques can not be applied for detecting changes, as its analysis only compares each newly generated rule with each existing rule to discover degrees of difference, and it doesn t find which aspect have changes, what kind of changes have taken place and how much change has happened. The forth research stream is mining from time-series data. There is an interesting interest to find out regularity from time-series data (Das et al, 1997; Das et al, 1998; Han, Dong & Yin, 1999 cited in Song et al, 2001). (Dos et al. 1998 cited in Song et al, 2001) believe the problem of finding rules relating patterns in a timeseries to other patterns in that series, or patterns in one series to pattern in another series, in fact they stress is in the discovery of local patterns in multivariate time series in contrast to the traditional time series analysis which mainly focuses on global models, and (Han et al., 1999 cited in Song et al, 2001) present several algorithms for efficient mining of partial periodic patterns, by exploring some interesting properties connected to partial periodicity. (Dos et al, 1997 cited in Song et al, 2001) also presents an intuitive model for measuring similarities between two time series, this model takes into account outliers, different scaling functions and variable sampling rates; but these studies are rather different from my research 33
which centers on the detection of irregularity rather than regularity form data. The fifth research field is mining class comparison to differentiate between different classes (Bay, D.S. & Pazzani, M., J., 1999; Ganti et al, 1999 cited in Song et al, 2001; Han, J., & Kamber, M., 2006). (Ganti et al, 1999 cited in Song et al, 2001) presents the general framework for measuring changes in two models. They develop FOCUS Framework for calculating an interpretable, suitable deviation measure between two datasets to compute the differences between interesting characteristics in each dataset. Fundamentally, the difference between the two models is quantified as the amount of work needed to change one model into the other. Their framework work covers a wide variety of models as well as frequent itemsets, decision tree classifier, and clusters, and captures standard measures of deviation such as misclassification rate and the chi-square metric as special cases. It offers deviation measures between the two mining model and focused regions but cannot be directly used to detect customer behavior changes because it doesn t provide which aspects are changed and which kind of changes have occurred. (Bay, D.S. & Pazzani, M., J., 1999); (Han, J., & Kamber, M., 2006) also provide techniques for understandings the differences between several contrasting groups, but these techniques can only identify change about the same structured rule. Finally, (Liu et al, 2000) presents a technique for change mining by overlapping two decision trees which are produced from different time snapshots, but this change mining technique using decision trees cannot identify complete sets of change. Since decision trees techniques run within a specified objective class, only changes about designated consequent attributes can be detected. This approach can be applied only in cases which have a precise research question. Also, this technique doesn t offer any information for the degree of change. (Song et al, 2001) had a research on Understanding and adapting to changes of customer behavior for an internet-based company. The aim of that research is to develop a methodology which discovers changes of customer behavior automatically from customer profiles and sales data at different time snapshots. They defined the 3 types of changes: emerging pattern, unexpected changes and added/ rules, then, similarity and difference measures for rule matching to detect all types of changes. Finally, the degree of change is assessed to detect significantly changed rules and rank them. Their proposed methodology can evaluate the degree of changes as well as finds all kinds of changes automatically from different time snapshot data. Another related research has been done by (Chen et al, 2005) which integrates customer behavioral variables, demographic variables, and transaction database to found a method of 34
mining changes in customer behavior. The behavioral variables, RFM, coupled with growth matrix of customer value, are used to estimate the value that individual customers give to the business. Association rules are used to identify the association between customer profile and product items purchased. For mining change patterns, two extended measures of similarity and unexpectedness are designed to study the degree of similarity between patterns at different time periods. Finally, an online query system provides marketing managers a tool for fast information search, and valuable information based on timely feedback. (Cho et al, 2005) have done a research for finding changes in customer buying behavior for recommendation systems and it declared that the needs of customer changes over time so we should take into consideration changes in customer preferences to progress the accuracy of the recommendations made. They suggest a new methodology for improving the quality of Collaborative Filtering (CF) recommendation that uses customer purchase sequences. The proposed methodology is used to a large department store in Korea and compared to existing CF techniques. Different experiments using real-world data show that the proposed methodology provides higher quality recommendations than do classic CF techniques, with better performance, particularly with regard to heavy users. (Au & Chan, 2005 cited in Song et al) present another technique to find changes in association rules. They present the meaning of the problem of mining changes in association rules over time. The proposed approach permits different fuzzy data-mining techniques to be used for tackling this problem. Given a set of database partitions, each of which encloses a set of transactions gathered in a specific time period, a set of association rules is found in each database partition. They suggest executing data mining in the discovered association rules so as to expose the regularities governing how the rules change in different time periods. They proposed to use linguistic variables and linguistic terms to represent the changes in the discovered association rules. Particularly, fuzzy decision trees are built to discover the changes in the discovered association rules. The fuzzy decision trees are then exchanged to a set of fuzzy rules, called fuzzy meta-rules because they are rules about rules. By doing so, the changes hidden in the data can be exposed and presented to human users in a comprehensible form. In addition, the discovered changes can also be used to forecast any change in the future. 35
Learning in a Changing Environment Mining in a Changing Environment Mining changes -FRUEND and -Song et al, 2001 Mansour, 1997 -Liu et al, 2000 -Helmbold & Long, 1994 cited by (Song et -Chen et al, 2005 al, 2001) Rule Maintenance Patterns Subjective interestingne Mining from Time Series Data Mining Class Comparisons -Cheung, Han, Ng -Agrawal & -Liu & Hsu, -Das, -Bay & Pazzani, & Wong, 1996a; Psaila, 1995 1996; -Liu et al, Gunopulous & 1999; -Cheung, Ng, Tam, -Dong & Li, 1999 1997 Mannila, 1997 -Ganti et al, 1996b; -Cheung et al, 1997 -Li, Dong & Ramamohanarao, 2000 -Liu, Hsu, Ma * Chen, 1999 -Padmanabhan & Tuzhilin, 1999 -Das, Lin, Manila, Renganathan & Smyth, 1998 1999; -Han, J., & Kamber, M., 2001 cited by (Song et -Han, Dong & al, 2001) Yin, 1999 -Silberchatz & Tuzhilin, 1996 -Suzuki, 1997 Figure 2.5: Mining in a changing environment review 36
Table 2.2: Mining in a changing environment timetable Mining Changes class comparison Time Series Subjective Interestingness Rule Maintenance Pattern 2005 2004 2003 2001 2002 2000 1999 1998 1997 1996 1995 Subject/ Year 2. 5Customer segmentation review: The mass marketing approach cannot satisfy the needs of varied customers today. This variety should be satisfied using segmentation that splits markets into customer clusters with similar needs and/or features that are likely to show similar purchasing behaviors (Dibb & Simkin, 1996 cited by Tsai, C., Y., Chiu, C., C., 2004). Segmentation theory suggests that groups of customers with similar needs and purchasing behaviors are likely to show a more homogeneous answer to marketing programs that aim specific consumer groups (Tsai, C., Y., Chiu, C., C., 2004). Market segmentation has accordingly been regarded as one of the most vital elements in achieving successful modern marketing and customer relationship management (CRM) (Berson, Smith, & Thearling, 2000 cited by Tsai, C.,Y., Chiu, C.,C., 2004). Segmentation variable selection is a critical concern for successful market segmentation."segmentation variables can be classified into general variables and product specific variables" (Wedel & Kamakura, 1997 cited by Tsai, C., Y., Chiu, C., C., 2004). The general variables consist of the customer demographics and 37
lifestyles. The product specific variables entail customer purchasing behaviors and intentions. Many researches have been done to use general variables to segment customers because the variables are intuitive and easy to operate (Beane & Ennis, 1987; Hammond et al., 1996 cited by Tsai, C., Y., Chiu, C., C., 2004). Market segmentation based on general variables is more instinctive and easier to conduct than product specific variables. But the assumption that customers with alike demographics and lifestyles will show similar purchasing behavior is unsure (Tsai, C., Y., Chiu, C., C., 2004). Here, we have briefly, reviewed some of segmentation methods from literature. 2. 5.1 Clustering Analysis Data mining is a type of analytic method for summarizing useful knowledge and realizing useful data mode from huge data (Wu, J., Lin, Z., 2005). In market research field, clustering is an effective and commonly used method for market segmentation, realizing targeted market and segments of customers. Clustering can be used as an independent tool to show data distribution, monitor cluster s characteristics and make an additional analysis of specific clusters if required (Wu, J., Lin, Z., 2005). 2. 5.2 Customer Segmentation Model The customer segmentation concept was built by American marketing expert, Wendell R. Smith, in the middle of 1950s. "Customer segmentation refers to classifying customers by their value, demands, preference and other factors in the circumstances of clear organization strategies, business model and targeted market". Customers in one group have definite similarities, whereas different segments of customers have clear characteristics. Customer segmentation model is built by classifying customers according to assured standards on selected segmentation variables. There are two types of consumption-based customer segmentation models (Wu, J., Lin, Z., 2005). 2. 5.3 RFM Model RFM segmentation model is a model that distinguishes important customers by three variables; customers consumption interval, frequency and spent money. R symbolizes recency referring to the interval between the time when the latest consuming behavior happens and present. How much the interval is shorter, the R is bigger. F symbolizes frequency referring to the frequency of consuming behavior in a period of time. M symbolizes monetary referring to consumption money amount in a period of time. Researches show that the bigger the R and F values are, the more likely the related customers are to make a new deal with ventures. 38
Furthermore, the bigger M is, the more likely the related customers are to react to ventures products and service again (Wu, J., Lin, Z., 2005). RFM method is very successful for customer segmentation. We can arrange customers by their consuming date and then we put the most recent customer in front. Thus, customers can be classified into some groups. Then, F and M are standardized and arranged in the same way as mentioned before. At this time, each customer is placed in a threedimension space, related to a coordinate of (R, F, M). By calculating R*F*M, the value of RFM for each customer can be achieved (Wu, J., Lin, Z., 2005). With these RFMs arranged, the groups of customers can be classified consistent with certain proportion. For example, to a commercial enterprise, customers whose RFM related values are in the first 20 percent can be considered as their most valuable customers (Wu, J., Lin, Z., 2005). It is essential to quantify customer behavior so that we can analyze the short and long term outcome of our segmentation formulae. The purpose of RFM is to give a simple framework for customer behavior analysis. Once customers are allocated RFM behavior scores, they can be grouped into segments and their consequent effectiveness analyzed. This effectiveness analysis then forms the basis for future customer contact frequency decisions (Miglautsch, J.R., 2001). There are some methods for RFM scoring in the literature which they are as follows. 2. 5.4 RFM Scoring The purpose of RFM scoring is to plan future behavior (driving better segmentation decisions). In order to allow planning, it is critical to interpret the customer behavior into numbers which can be used through time (Miglautsch, J.R, 2001). Too often, direct marketers will use static customer selections. When initially building their segmentation system, they defined some factors with some thresholds. If these thresholds keep fixed, the results will be poorer and poorer over time. It is called bracket creep problem (Miglautsch, J.R, 2001). Some common scoring methods are used to avoid this problem. a. Customer Quintiles The most common scoring method is to arrange customers in downward order (best to worst). Customers are then divided into five equal groups or quintiles. The best group receives a score of 5, the worst a 1. For Recency, customers are sorted by days since last purchase, the lower the number of days, the better the score is. For Frequency, customers are sorted by purchases number, the upper the number of 39
purchases, the better the score. And for Monetary, customers are sorted by the amount of money spent. The upper the amount, the higher the score is. Each time customers are scored, a new comparative segmentation scheme is built. This has the benefit of quantifying customer behavior which can be projected into the future (Miglautsch, J.R, 2001). The comparatively best customers would always fall into the 5, 5, 5 category. It is essential to recognize where the cutoff points fall, since they automatically change with each customer scoring. The customer quintile method has the benefit of yielding equal numbers of customers in each segment. There are five equal groups for RFM, generating 125 equal size segments in general. Initial analysis would be to contact all customers, look at the performance of each individual cell and understand how different segments of the customers carry out (Miglautsch, J.R, 2001). The customer quintile method does meet some scoring confronts in the area of Frequency. In most direct marketing customer files, a high percentage of the customers have only ordered once. This percentage is frequently as high as 30%- 60%. If more than 20% of the customers have only one purchase, then the lowest Frequency group will have a purchase amount of 1. Since that group cannot keep all the customers with only one purchase, some of them will be sorted into the 2 score group. Their behavior is identical to those in the 1 score, they simply overflowed. If 40% of the customers had only one purchase, then both 1 and 2 score groups would have equal behaviors. If the percentage ran as high as 60% (which is not that unusual) then three of the five quintiles would have the equal behavior. Remembering the reason of RFM, this would be a less than satisfying result. A second concern with the quintile method is its relative sensitivity. At the high end of our Frequency model customers average 7.4 purchases. That is significantly more than the 1.0 purchases at the bottom and approximately twice as great as the 3.4 purchases in the 4 score group. However, the Paretto Principle (commonly called the 80/20 rule) still applies within the 5 score group. This means that there are a small number of very large customers and a larger number of relatively smaller customers who make up that 7.4 average (Miglautsch, J.R, 2001). As long as our segmentation method is primarily built for mailing goals, this difference is debatable. Certainly the 5 and 4 groups would be mailed. However, if our RFM model is being used to make possible telemarketing or field sales contact, extra sub-segments would be vital to identify the super customers. The customer quintile scoring method produces some unsatisfying results at both the top and bottom of the scale. It tends to group together customers who have hugely different 40
buying behavior (at the top) and subjectively break apart customers who have same behavior (at the bottom) (Miglautsch, J.R, 2001). b. Behavior Quintile Scoring An alternative scoring method has been made by John Wirth. It also sorts customers by behavior but, instead of building arbitrary cutoffs at an assured percent of the customers, it produces cutoffs on percentage of behavior. This method appears to defeat the sensitivity problems mentioned above. Five groups are still produced, but monetary score would produce equal amounts of sales in each quintile. Behavior scoring has the benefit of grouping customers by similar behavior. Since segmentation decisions are based on precedent customer behavior, this permits better segmentation (Miglautsch, J.R, 2001). i. Frequency The Behavior method does suffer from similar problems when beginning Frequency score. If we initiate at the top of the Frequency sort and deduct each customer s frequency from total Frequency, the customers purchased only once may not equal 20% of entire Frequency. In such a situation, some of the customers who have bought twice will be included in the 1 score group with this method. It is also worrying to sort customers from top to bottom in a computer generated scoring system. A particular sort file must be created and each scoring process must be accomplished uniquely. The Mean scoring method, an additional improvement of the John Wirth method has been developed by Ted Miglautsch (Miglautsch, J.R, 2001). When scoring Frequency, the solitary purchasers are given a score of 1. The system then averages the remaining Frequencies to find out the mean. If a customer total falls below the mean, he will have the score of 2. This process is duplicated two more times giving us quintiles of behavior which have sensitivity on both ends of the scale and let scoring of many variables at the same time (Miglautsch, J.R, 2001). ii. Recency Because previous behavior is the best predictor of future behavior, Recency is normally considered the most influential of the three variables. Recency plays an important role in direct marketing decision making. Recent customers are considered viable for a assured length of time. Unlike Frequency and Monetary, customers reset themselves. At the center of Recency is the fact that most of the 41
customers fall into two groups: hot and dead. Although Recency can be scored by sorting customers by days since last purchase, industry list meeting suggest a more calendar based method. Hotline names normally represent purchasers within three months or 90 days. Business-to-business direct marketers often lengthen these time frames since their customers can stay viable even though individuals change (Miglautsch, J.R, 2001). c. Weighting With relational, database-driven marketing databases becoming more ordinary, most marketers can select RF&M scores separately. Though, others are not as lucky and need a single field to do the work of all three variables. The benefit of a single variable is that customers can simply be segmented by a single query on one field (Miglautsch, J.R, 2001). Donald R. Libey, in his book "Libey on RFM", proposes that Monetary, Frequency and Recency values can be added jointly (Miglautsch, J.R, 2001). Scoring is not explicitly argued but he present a formula for creating a single RFM value. His method contains adding average order and Frequency per year. To improve this complex formula, marketers can multiply Rx3, Fx2 and Mx1. This would give the best customers a composite score of 30 (5x3)+(5x2) +(5x1). This not only gives more influence to the most recent names, it also gives a bit of a boost to Frequency. The logic behind this, is that if two customers have the same Recency, spent the equal amount but one purchased several times and the other only once, the more frequent buyer is much more probable to react. One extra enhancement is often employed in generating a complex score. Instead of multiplying by 3, 2&1, alternate 9.9, 6.6 and 3.3. This produces a range of complex scores between 99 and 19.8. It preserves the approximately 3x weighting of R; it produces more of a 100 point scale (Miglautsch, J.R, 2001). d. Life-to-Date Generally, RFM scoring is stand on life to date totals. It is frequently requested whether it would progress RFM scoring to shorten up the time frame. The idea is that if Recency is so influential, maybe we should consider only the recent behavior of the precedent few years; an excellent proposal but filled with danger. The basic idea again is quantifying behavior for the point of customer segmentation. It is clear that high RFM customers are easily recognized. The factual challenge is to recognize viable customers further than the 12 month window in some areas like direct marketing. Should any of them be mailed and marketed? Certainly some 42
should. To gain this wider viewpoint, it needs that all obtainable customer history be examined (Miglautsch, J.R, 2001). 2.5.5 Customer Value Matrix Model The Customer Value Matrix was made from a want to apply RFM to the small-business retail environment. After some experiments with applying RFM in small businesses, it became clear that RFM was too difficult and time-consuming for them. The problem was that, while RFM was comparatively easy conceptually, the consequential segmentation was often complex to understand and even more difficult to use them. By three values per RFM variable, RFM analysis makes 27 customer segments. For RFM analysis to be useful, the marketer must know which groups can be combined for a exacting strategy or tactic (Marcus, C., 1998). Closer test of the RFM analysis emphasized the co-linearity of the Frequency Purchase frequency and the total Monetary Value variables. An extra purchase by a customer results an increase in the total monetary value of that customer. Given this result, Charles Edmundson recommended using Average of Purchase Amount instead of the total Monetary Value of a customer. By this, we eliminated the colinearity between the two variables. Besides, for the more clarity, the Purchase Frequency variable was changed to Purchases number. These changes showed refinements over usual RFM analysis; though, they did not determine the problem of finishing up with too many segments to interpret and to work with (Marcus, C., 1998). For solving this issue a simplified, more actionable version of RFM was needed. In the first step, we centered on the two variables that best expressed the value of a customer: Purchases Number and Average Purchase Amount. The third variable, Recency, gives motivating information that can be joined with the two key variables. We can also use other important variables such as Type of Purchase or Length of Relationship. Using just Purchase Frequency and Average Purchase Amount was piece of the answer; moreover it is needed to simplify the segmentation to a 2 * 2 matrix. Matrices have been effectively used to help in the understanding of information for decision-making reasons. Maybe the most usually known matrix is Boston Consulting Group s (BCG) Growth-Share Matrix centering on allocation of resources given the market share place and growth potential of a given set of business opportunities (Henderson, 1967; Porter, 1980, cited bymarcus, C., 1998). The BCG Growth-Share Matrix can be used to market segments, products or still countries. BCG s Growth-Share Matrix segments business opportunities into obviously defined groups (Cash cows, Stars, Dogs and Question marks). The use of a comparatively straightforward method and easy-to-understand quadrant identifiers 43
has made the BCG Matrix an effective analytical tool. The BCG Matrix adds further value by involving what managerial strategies and tactics are to be chased with each business segment. Businesses that have high relative market share in low growth markets (Cash cows) can be applied to support other developing businesses, as lowrelative-share businesses in low-growth markets are probable to be cash traps (Dogs). Simplifying the RFM analysis to center on the customer-value-based variables, Purchases Number and Average Purchase Amount, and applying a 2 * 2 matrix to correspond the resulting segmentation verified to be active in arriving at a practical yet meaningful approach to customer segmentation (Marcus, C., 1998). Customer value matrix model is an advanced model that is based on the traditional RFM model. In this model, customer value matrix includes of the times of purchasing (shown by F) and the average amount of purchasing (shown by A) (Wu, J., Lin, Z., 2005). Average amount of purchasing replaces two variables in RFM model between which there is multi-co linearity, which omits their linear result on RFM model. In customer value matrix, the foundation value of F and A is their average value correspondingly. Once the division of the axis is determined, customers are located in one of the quadrants of the customer value matrix. By the value of A and F, customers are categorized into four groups in the matrix, for example customer who likes to consume (shown by I), customer who is important for ventures (shown by II), customer who frequently consumes (shown by III), and customer whose behavior is unsure to ventures (shown by IV). The consequence is accessible in Figure 1(Wu, J., Lin, Z., 2005). 44
Chapter3: Research Methodology Background of the study (Problem definition) Research question Research objectives Research motivation Research outline 45
3.1Research Methodology: 3.2Research Design: A research design is a roadmap for performing the marketing research project. It gives details of each step in the marketing research project. Accomplishment of the research design should result in all the information requested to construction or solve the management-decision problem (Malhotra, K.N., 1996). Many designs maybe are suitable for a given marketing research problem. A good research design ensures that the information gathered will be related and useful to management and that all of the necessary information will be achieved. A good design should also assist to ensure that the marketing research project will be performed effectively and efficiently (Malhotra, K.N., 1996). The research design of this study is illustrated in figure 3.1. Detailed descriptions are explained below. Figure 3.1: Research design of this study 3.3Research Purpose: According to (Malhotra, K.N., 1996), basic 46
research designs can be categorized in terms of the research objectives. They are of two wide types of research: exploratory and conclusive. These types are explained below. Exploratory research is a research performed to explore the problem situation to achieve ideas and insight into the problem facing the management or the researcher (Malhotra, K.N., 1996). A conclusive Research is designed to help the decision maker in determining, assessing, and choosing the best manner of action for a given condition (Malhotra, K.N., 1996). Conclusive research is in two types: Causal and Descriptive. Causal research is a kind of conclusive research whose major goal is to gain evidence concerning cause-and-effect (causal) relationships (Malhotra, K.N., 1996). Descriptive research is a kind of research that has as its main goal the explanation of something usually market features or functions. Descriptive research supposes that the researcher has previous knowledge about the problem situation this is one of the main differences between deceptive and exploratory research (Malhotra, K.N., 1996). Among the main kinds of descriptive studies are internally or externally centered sales studies, consumer perception and behavior studies, and market characteristics studies. Additionally, descriptive research uses different verity of data collection methods like secondary data analyzed quantitatively or surveys (Malhotra, K.N., 1996). The approach of our study is data mining. According to data mining definition by (Han, J., & Kamber, M., 2006), data mining refers to mining" knowledge from large amounts of data". When approaching a data-mining problem, a data-mining analyst may already have some a priori hypotheses that he or she would like to test concerning the relationships between the variables (Larose D.T., 2005). Though, all the time, analysts do not have a priori notions of the expected relationships among the variables. Particularly when faced with large unknown databases, analysts often prefer to use exploratory data analysis (EDA) or graphical data analysis (Larose D.T., 2005). Exploratory data analysis (EDA) let the analyst to explore the data set, check the interrelationships among the attributes, recognize attractive subsets of the observations, and develop an original idea of possible relations between the attributes and the target variable, if any. Data mining approaches are in two kinds: Descriptive and Predictive (Han, J., & Kamber, M., 2006). Predictive mining tasks make deduction on the current data 47
so as to make predictions (Han, J., & Kamber, M., 2006). Descriptive mining tasks typify the general properties of the data in the database. The focus of this study is data mining, which is an approach that combines exploration and confirmatory analysis. So the purpose of this research is exploratory. While, we try to understand customer behavior by building pattern by data mining tools. According to definition of data mining approaches, the focus of our data-mining task is descriptive. 3.4Research Approaches: There are two kinds of approaches for research design: quantitative and qualitative (Malhotra, K.N., 1996). "Quantitative research is an unstructured, exploratory research methodology based on small samples that provides insights and understanding of the problem setting "(Malhotra, K.N., 1996). In contrast, Qualitative research is a methodology that searches to quantify the data and usually applies some form of statistical analysis. The findings of this kind of research can be treated as conclusive and applied to suggest a final course of action (Malhotra, K.N., 1996). Descriptive researches frequently are quantitative research (Malhotra, K.N., 1996). The concept of data mining allow decision maker to be supported by qualitative descriptive research. The focus of this study is on data mining so the research approach of this study is quantitative. 3.5Research Strategy: Research strategy will be a common plan of how you are going to respond the research questions. It is a particular way to gather data (Saunders et al, 2000). A researcher based on the research question should choose among survey, secondary data, case study, experiment or history (Yin, R.K, 1994). There are two kinds of data generally used in researches: Primary data and Secondary data. Primary data is produced by the researcher particularly to address the research problem (Malhotra, K.N., 1996). Secondary data is data collected for some reason other than problem at hand (Malhotra, K.N., 1996). It consists of information made existing by business and government sources and computerized databases. Secondary data are a reasonable and fast source of background information (Malhotra, K.N., 1996). Two major categories are defined for secondary data: Internal secondary data and External secondary data (Malhotra, K.N., 1996). External secondary data create external to the organization (Malhotra, K.N., 1996). Internal secondary data is data available within the organization for which 48
the research is being performed (Malhotra, K.N., 1996). While, it is possible that internal secondary data may be accessible in practical form, it is more usual that considerable processing effort will be required before such data can be applied (Malhotra, K.N., 1996). The focus of this study is data mining and the data has been collected from the database of Kalleh Company, so the suitable strategy for this research is secondary data which is internal. In the end, as the focus of this study is on data mining, the purpose of this research is exploratory. The approach of this study is quantitative, strategy of the research is secondary and internal data and the data mining approach is descriptive. 3.6Research process: The purpose of this research is to understand changes happening in the customer buying behavior during time. Figure 3.2 shows the general overview of change mining flowchart. Input Data: RFM, Demographic, Product Change Miner Change patterns Figure 3.2: Change mining process perspective As it is shown, the input of this flowchart is RFM data which show the customer purchasing behavior, some demographic variables and product data. This data induced to the Change Miner. Change mining procedure consists of different steps implemented by different data mining techniques and algorithms in each. In chapter2, different studies related to change mining were reviewed. Based on literature, change mining has several steps, includes describing customer behavior by mining association rule and mining change pattern. Most of the works have been done for retail marketing. The focus of this research is using change mining in behavior of FMCG manufacturer and Distributor Company. It is to analyze the customer behavior in two time snapshots and the output will be change patterns happened during time periods. In this thesis, the research process has been followed. The following process is based on previous methodology on Change mining (Chen et al, 2005). In this study, according to pervious works, different steps of change mining were studied and with some changes integrated to a unique process. The whole process of change mining is shown in figure 3.4. The process 49
consists of several steps such as Data Collection, Data Pre-Processing, Processing, Customer Segmentation, Mining Customer Behavior, and Change Mining. Data Collection Data Pre- Processing Customer Segmentation Mining Customer Behavior Change Mining Figure 3.3: Change mining process Each step by itself consists of several tasks. In figure 3.4 the detail of each step is illustrated. Data Collection Data Pre-ProcessingProcessing Data Cleaning Data Transformation RFM variables Cstomer Segmentation Building Customer Value Matrix Segment Customers Mining Customer Behavior Mining Association Rules in each time snapshot Change Mining Rule-Matching: computing similarity & difference measure Mining change pattern includes pattern, added/ rule and unexpected pattern Figure 3.4: Change mining process in detail As can be seen in figure 3.4, there are various steps in mining changes in customer behavior. In order to implementing this method, programming required. In this study, we use SQL server 2000 for data preprocessing like building RFM, market segmentation. Also for building customer behavior pattern, we use R Open source language Programming (R Software, 2007). In the next section, each step and the methods have been used, explained in detail. 3.7Data Collection and Description: After defining the research question and determining the appropriate research strategy, we should determine the data to connect to the research question (Yin, R.K, 1994). As mentioned in research strategy, empirical data are usually in two types: primary data and secondary data. Primary data is produced d by the researcher specially to deal with the research problem (Malhotra, K.N., 1996). 50
Secondary data is data gathered for some reason other than current problem (Malhotra, K.N., 1996). It consists of information prepared by business and government resources and computerized databases. Secondary data can classify to two types, internal and external data (Malhotra, K.N., 1996). For data mining purposes secondary data used mostly. Change mining researches mostly work with secondary data collected in business databases. Hence, this study is based on secondary data were gathered from Kalleh company which is a manufacturer and distributor of food products in Iran market. Here we bring a brief history of Kalleh Company. This data is purchasing transaction data of Kalleh customers like fast-foods, restaurants and coffee-shops which buy different categories of product from this company. We Saved data in SQL 2000 (SQL Server, 2000). According to (Chen et al, 2005), data for change mining includes 3 categories: Customer data: for market segmentation and mining customer behavior one kind of variables needed is demographic data. In this study, based on the collected data, we have one demographic variable which shows the geographic area of each customer. Beside, there was another variable, customer type that because of missing value, it couldn t provide any value for us, which we remove them. Product data: showing different products provided for customers. In our data we have about 800 product and 13 different product categories which are shown in fig3.5 Purchasing transaction data of customers: usually, some valuable variables are hidden in large quantity of raw data and can be achieved by data integration and transformation. Customer behavioral variables like Recency, Frequency and Monetary are unknown in customer and transaction database. They can be taken out from these data (Chen et al, 2005). In this study, RFM variables used to analyze customer purchasing behavior during two time snapshots. The data gathered from 2 years of purchasing transaction while the number of kalleh customers in the market of restaurants and fast-foods of Tehran during this period is about 2457. Table3.1 shows the information data gathered from Kalleh company database. This information gathered from Kalleh database based on the literature and expert opinions. 51
Table3.1: Data collected from Kalleh Company Customer Data Geographic Area of each customer Product Data Product Code Product Categroy Transaction Data Data of purchase Purchase amount: price Product purchased number of orders for each product Product Categories Pizza Cheese Cooking Cheese Processed Cheese Tehran meat products -not freezing Amol Meat products-not freezing Freezed meet product Yogurt-Milk-drinking Yogurt Other Dairy products Ice Creams sauces Dishes Other complementory products Figure 3.5: Product categories of Kalleh Company For this study 3 types of data were needed includes, purchasing transaction data for extracting RFM, customer data and product data. RFM variables which are the input of change mining process which are extracted from purchasing transaction data to analyze the customer behavior. Besides, customer geographic data and product data extracted from Kalleh database. 52
3.8 Data Pre-Processing: Much of the raw data included in databases is un-preprocessed, imperfect, and noisy. For instance, the databases may include fields that are obsolete or unnecessary, missing values Outliers, Data in a form not appropriate for data mining models, values not steady with policy or common sense. The databases require undergoing preprocessing, in the form of data cleaning and data transformation to be practical for data mining reasons (Larose, D.T., 2005). Data cleaning and Integration: Prior to analysis, data accuracy and consistency must be guaranteed to gain correct results (Chen et al, 2005). Real-world data mostly are unfinished, noisy, and unpredictable. Data cleaning processes effort to fill in missing values reduce noise while recognizing outliers, and accurate inconsistencies in the data (Han, J., & Kamber, M., 2006). Noisy Data: The data saved in a database may reveal noise, exceptional cases, or imperfect data objects. When mining data regularities these objects may mystify the process. Consequently, the correctness of the discovered patterns can be poor. So it should be regarded as to handle these noises and exceptional cases (Han, J., & Kamber, M., 2006). In our database, it the customer base and purchasing transactions, we have some customers that belong to the Kalleh. These are noises that we have in our database which remove all of them by their IDs from database. Missing values: When you have some record that some of these attributes have no value called missing value. There are several methods to fill in missing values such as ignoring the record, filling in the missing value manually which is time-consuming and may not be feasible given a large data set with many missing values, using a global constant to fill in the missing value and replace all missing attribute values by the same constant, such as a label like Unknown" or using the attribute mean to fill in the missing value or some other ways that exist in literature (Han, J., & Kamber, M., 2006). It is important to note that, in some cases, a missing value may not involve an 53
error in the data. Software routines may also be applied to expose other null values. Therefore, though we can try our best to clean the data after it is gathered, good design of databases and data entry accuracy; procedures should help reduce the number of missing values or errors in the first place (Han, J., & Kamber, M., 2006). In this study, excepting one variable, the design of the database for sales transaction are sensitive to null values in data entry moments which help to minimize the missing values. One variable that has missing value is customer type that it doesn t sensitive to null value in data entry moments in database and we faced a huge number of null values. Hence, this variable could not create any value for us; we remove it from our work. Data Transformation: When the data are transformed or consolidated into forms appropriate for mining, it is called data transformation. Data transformation can include the following Tasks (Han, J., & Kamber, M., 2006). Where summary or aggregation operations are used to the data, it is called Aggregation. For instance, the daily sales data may be aggregated, to calculate monthly and yearly total amounts. This step is normally used in building a data cube for analysis of the data at multiple granularities (Han, J., & Kamber, M., 2006). In this study, invoices sales data aggregated to compute the average sales per each period and average of frequency of purchases. In addition, we use aggregation for calculating the average purchases for each customer and total number of purchases. Another data transformation task is Data Generalization, where low-level or primitive (raw) data are substituted by higher-level concepts during the use of concept hierarchies. In this study, we have about 800 products in 13 categories. We replace products by categories and then replaced by super-categories according to the experts opinion for mining purposes. Attribute construction (or feature construction), where new attributes are built and added from the given set of attributes to assist the mining process (Han, J., & Kamber, M., 2006). Generally, some useful variables can be unknown in a large quantity of raw data, and therefore can be gained through data integration and transformation (Chen et al, 2005). In this study, we use customer behavioral variables (Recency, Frequency and monetary) for customer segmentation. These variables are hidden in customer and transaction databases, and can be extracted from data integration and 54
transformation (Chen et al, 2005). (Stone, 1995 cited by Chen et al, 2005) mentioned that "recency is the interval between the most recent transaction time of individual customers and evaluation time". In this study, we consider the evaluation time the next day of the end of each period. Frequency shows the average expenditure of a customer during a period (Chen et al, 2005). Finally, frequency shows the number of purchases in each period for each customer. In this study, we calculate number of purchases for each customer as frequency, average sales for each customer in period and the interval between the last purchase and the day after the last date of each period. Data Discretization: Since the data needed for analyzing association rules must be discrete, continuous variables should be altered to discrete type. Discrete values have significant roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to aknowledge -level representation than continuous values. Many studies represent induction tasks can profit from discretization because rules with discrete values are usually shorter and more understandable and discretization can lead to advanced predictive accuracy. As well, mining in a reduced data set need fewer input/output operation and is more efficient then a larger and un-generalized data set. Because of these benefits, discretization techniques are used previous to data mining as a preprocessing task. Discretization technique can be classified based on how the discretization is done, such as whether it uses class information or which direction it proceeds. If the Discretization process employs class information, then it is said supervised Discretization, or else, it is unsupervised. If the process begins by first result one or a few points called cut points to split the whole attribute range, and then duplicates this recursively on the resulting intervals, it is called top down Discretization or splitting. This difference with bottom-up Discretization or merging starting by considering all of continuous values as potential split-points, takes out some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals (Han, J., & Kamber, M., 2006). Discretization can be done recursively on an attribute to give a hierarchical or multi-resolution partitioning of the attribute values, recognized as concept hierarchy. Concept hierarchies are helpful for mining at multiple level of abstraction. Though data is lost by these generalizations with data reduction by collecting and replacing 55
low-level concepts by high-level concepts, the generalized data may be more significant and easier to interpret (Han, J., & Kamber, M., 2006). There are numerous discretization methods available in the literature based on different definitions mentioned above. In this study, the discretization method used, is binning. Binning is the simplest method discretizing a continuous-valued attribute by producing a particular number of bins. The bins can be produced by equal-width and equal-frequency (Liu et al, 2002). In equal-width, the continuous range of a feature is evenly separated into intervals that have an equal-width and each interval represents a bin. In equalfrequency, an equal number of continuous values are set in each bin (Liu, 2002).these methods are responsive for a given number of bins. In this study, based on domain experts' opinions, for each variable, we discretize them by equal frequency method. 3.9Customer Segmentation: There are many analytics methods which applied for market segmentation. One of the most traditional approaches of market segmentation is demographic segmentation. The other methods have also use buyer attitudes, motivation an attitudes and pattern of usage. Companies that capture customer and purchase information apply such information to analyze customer behavior for their marketing efforts (Marcus, C., 1998). While the availability of customer purchase information has permitted marketers to develop richer, more complicated customer segmentation schemes, simplicity has also proven its place. For many years, RFM (recency, frequency and monetary value) has been applied to segment customers to assist marketers optimizing their marketing efforts. Many times, RFM has been confronted by innovative conceptual approaches prepared possible by new technologies such as neural networks. Yet, many marketing tasks continue to count on RFM variables, particularly direct marketing because the lift experienced using alternative methods does not normally guarantee the costs of implementing those methods. There are costs linked with improved technical complexity, particularly that of taking the analysis away from marketers and putting it into the hands of programmers and statisticians. Besides the costs of explanation and communication as marketers require to develop actionable strategic and tactical decisions from the research findings are important. The Customer Value Matrix is a customer segmentation technique that is simple yet, powerful approach overcoming the above limitations. Its effectiveness lies not only in that it recognizes key customer segments, but also 56
in that, it emphasizes appropriate marketing strategies and tactics in a manner that can be eagerly communicated and easily executed (Marcus, C., 1998). 3.9.1 Customer Value Matrix The Customer Value Matrix was developed from a desire to apply RFM to the small-business retail environment, but it became clear that RFM was too complex and time-consuming for marketers. There were some problems, which are as follows: while RFM was comparatively simple conceptually; because of producing too many segments, the consequential segmentation was often difficult to understand and even more difficult to apply. Additionally, Closer test of the RFM analysis highlighted the co-linearity of the Frequency of Purchase and the total Monetary Value variables. (Charles Edmundson cited by Marcus, C., 1998) recommended using Average Purchase Amount as an alternative of the total Monetary Value of a customer to eliminate the co-linearity between the two variables. In addition, for greater precision, the variable Frequency of Purchase was transformed to Number of Purchases. These changes showed refinements over usual RFM analysis; however, they did not solve the problem of ending up with too many segments to understand and to work with (Marcus, C., 1998). What was required was a simplified, more practical version of RFM. First, centered on the two variables that best explained the value of a customer: Number of Purchases and Average Purchase Amount and the other was simplifying the segmentation to a 2*2 matrix (Marcus, C., 1998). 3.9.2 An effective analytical tool Matrices have been effectively applied to help in the information understanding for decision-making goals. Maybe the most usually known matrix is Boston Consulting Group s (BCG) Growth-Share Matrix, which centers on allocation of resources specified the market share position and growth potential of a given set of business opportunities (Henderson, 1967; Porter, 1980 cited by Marcus, C., 1998 ). The BCG Growth-Share Matrix can be used for segmenting markets and products. BCG s Growth-Share Matrix segments business opportunities into four obviously described groups (Cash cows, Stars, Dogs and Question marks). The BCG Matrix adds additional value by involving what managerial strategies and tactics are needed with every business segment. The application of a comparatively simple scheme and easy-to-understand quadrant identifiers has made the BCG Matrix an effective analytical tool (Marcus, C., 1998). 57
Simplifying the RFM analysis to center on the customer-value-based variables, Number of Purchases and Average Purchase Amount, and using a 2*2 matrix to correspond the resulting segmentation proved to be active in arriving at a realistic and meaningful approach to customer segmentation (Marcus, C., 1998). In this study, according to literature and based on expert opinions, we have chosen Customer Value Matrix to segment customer behavior. 3.9.3 Customer Value Matrix Methodology Customer Value Matrix building has some steps. In the first step, we require some basic customer and purchase information to involve in a relatively simple methodology. In the second step, the segmentation process executes to allocate each customer in the Customer Value Matrix. Finally, we should obtain four segments with key differences among the resulting customer segments (Marcus, C., 1998). Data: The data requested to develop the Customer Value Matrix are customer identification (ID) number, the purchase date and the total purchase amount. The customer ID number is applied to finding out the purchases of each customer. The total Number of Purchases is basically a count of the unique dates for a given customer s invoices. The total amount of each purchase is applied to calculate the Average Purchase Amount (Marcus, C., 1998). In this study, the data that we have to build Customer Value Matrix is customer identification number, the date of each purchase and the total amount of each purchase. Segmentation: The segmentation process using the Customer Value Matrix needs the computation of the average values for the Number of Purchases and Average Amount Spent. The average value for the x-axis, or Average Number of Purchases, are considered by taking the total number of purchases for the customer base and splitting it by the total number of customers in database. The average value for the y-axis, or Average Purchase Amount, is obtained by taking the total revenue and splitting it by the total number of purchases. The axes averages then provide to separate the high and low values on each scale. In this study, according to gathered data, we could not calculate the revenue, so instead of revenue we put total sales. Table 3.2 shows these variables and their calculation for this study. You can see result in chapter4. 58
Table3.2: calculating variables for customer value matrix Then, we compare each customer s Average Number of Purchases and Average Purchase Amount to the gained average values for the whole customer base. So, each customer is allocated exclusively to one of the four segments based on whether they are above or below the axis averages. The output of this step is a matrix as illustrated in figure 3.5. Monetary Avg. Monetary Spender Uncertain Best Frequency Avg. Frequency Frequency Figure 3.6: customer value matrix You can see the result of this step in next chapter. 59
Customer Value Matrix centers on the Number of Purchases and the Average Purchase Amount as the primary variables, as best representation of the customer value. Using the Customer Value Matrix as the foundation, any number of variables (like geographic, demographic, the purchase recency or the customer relationship length) may be overlaid on the segmentation to get more detail according to the customer data and their transactions. The methodology for the development of the Customer Value Matrix shows that relatively simple yet effective customer segmentation is indeed possible. In this study, according to the literature review and by considering expert opinion, we segmented our customers based on customer value matrix by (Marcus, C., 1998). Customer Value Matrix focuses on the Average Purchase Amount and the Number of Purchases as the primary variables, which are best portray of customer value. By this method we have four segments which they can be differentiated. The result can be found in chapter4. 3.10 Mining Customer Behavior: Different methods to describe customer behavior exist in the literature. Among them, there are various types of conjunctive rules to build customer behavior pattern including association rules and classification rules (Agrawal et al. cited by Adomavicius, G. Tuzhilin, A., 2001). Using rules to describe customer behavior has certain advantages. Besides being an intuitive and descriptive way to represent behaviors, a conjunctive rule is a wellstudied concept used extensively in data mining, expert systems, logic programming, and many other areas. In addition, researchers have proposed many rule discovery algorithms in the literature, especially for association rules (Adomavicius, G. Tuzhilin, A., 2001). To discover rules that describe the behavior of customers, we can use various data mining algorithms, like Apriori for association rule mining. Association rules were originally used to analyze the relationships of product items bought by customers at retail stores (Agrawal, Imielinski, & Swami, 1993; Srikant, Vu, & Agrawal, 1997 cited by Chen et al, 2005). In a customer behavior research, association rule can be used to find the correlations between customer profiles shown by demographic variables and purchased product by exploring customer and product databases (Song et al, 2001). In this research based on the literature, we mine customer purchasing behavior by association rule. 3.10.1 Association Rule Mining: A classic and normal association rule has 60
an implication of the form A->B, where A is an itemset and B is an itemset that includes only a single atomic condition (Song et al, 2001). A and B are statements regarding the values of attributes of an example in a database (Song et al, 2001). "A is termed the left-hand-side (LHS), and is the conditional part of an association rule. Meanwhile, B called the right-hand-side (RHS), and is the consequent part". A and B are frequent itemset, If the relative support of an itemset satisfies a pre-specified minimum support threshold (Chen et al, 2005). The support of an association rule is the percentage of records containing itemset A and B at the same time. The confidence of an association rule is the percentage of records including itemset A that also include itemset B. the support shows the usefulness of the revealed rule and the confidence signifies certainty of the found association rule (Song et al, 2001). The most usual use of association rules is market basket analysis, in which the market basket contains the set of items (namely itemset) purchased by a customer during a single store visit (Chen et al, 2005). Association rule mining discovers all collections of items in a database whose confidence and support meet or go above a pre-specified threshold value (Song et al, 2001). In this research we use the Apriori algorithm that introduced by (Agrawal et al, 1993) to build profile association rule. In the next section, we explain about Apriori algorithm and the way it works. 3.10.2 Apriori algorithm: Apriori algorithm is one of the common techniques used to find association rules (Agrawal et al, 1993). The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori uses an iterative approach recognized as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and gathering those items that assure minimum support. The consequential set is indicated L1. Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k- itemsets can be found. The result of each Lk needs one full scan of the database. To advance the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. It means that all nonempty subsets of a frequent itemset must also be frequent. By definition, if an itemset I does not satisfy the minimum support threshold, then I is 61
not frequent, that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I A) cannot happen more frequently than I. Therefore, I A is not frequent either, that is, P(I A) < min sup. Apriori property used in the algorithm has two steps consisting of join and prune actions to make it more efficient. A major challenge in mining frequent itemsets from a large data set is the fact that such mining often produces a huge number of itemsets satisfying the minimum support (min sup) threshold, especially when min sup is set low (Han, J., & Kamber, M., 2006). To overcome this difficulty, two concepts of closed frequent itemset and maximal frequent itemset have been introduced. An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in set S. An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there exists no super-itemset Y such that X Y and Y is frequent in S(Han, J., & Kamber, M., 2006). Once the frequent itemsets from transactions in a database D have been found, it is simple to produce strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). This can be done using Equation (1-4) for confidence, which we show again here for completeness: Equation (1-4) _, _ The conditional probability is expressed in terms of itemset support count, where support count A B is the number of transactions containing the itemsets A B, and support count(a) is the number of transactions containing the itemset A. So, association rules can be generated as follows: For each frequent itemset l, generate all nonempty subsets of l. "s l s " _ min _, _ For every nonempty subset s of l, output the rule, where min confis the minimum confidence threshold (Han & Kamber, 2006). 62
In this research, the customer behavioral variables (RFM) and geographic variable are associated with purchased products to build customer purchasing behavior patterns. The association rules discovered at two periods of time to adopt for change mining to identify customer behaviors that change over time. We have applied Apriori algorithm by maximal frequent itemset to build association between customer attributes and their purchased products. These rules can include any number of attributes on either side of the rule. In the left hand side (LHS) or conditional part of the rule, we have RFM and customer data and in the right hand side (RHS) or consequent part, we have purchased product items. Not all association rules are interesting to decision makers. Rule support and confidence are two measures of rule interestingness and an interesting rule must satisfy the minimum support and confidence determined by domain experts. 3.11 Change Mining: After building customer behavior patterns, we want to mine changes happened in customer purchasing behavior. In this study, two measures of similarity and unexpectedness from (Chen et al, 2005) are applied to investigate changes in customer behavior. Also we have applied ordered variables information in calculating these two measures which resulted in more knowledge of changes which didn t done in the previous works. First, we explain about change pattern and then mathematically explain about the measures. 3.11.1 Change Mining: After building customer behavior patterns, we want to mine changes happened in customer purchasing behavior. In this study, two measures of similarity and unexpectedness from (Chen et al, 2005) are applied to investigate changes in customer behavior. Also we have applied ordered variables information in calculating these two measures which resulted in more knowledge in found changes which didn t done in the previous works. First, we explain about change pattern and then mathematically explain about the measures. Change Patterns: Based on previous studies, four patterns are identified to measure changes in 63
customer behavior (Dong, G., & Li, J., 1999; Liu & Hsu, 1996; Padmanabhan & Tuzhilin, 1999; Song et al, 2001 cited on Chen et al, 2005). These patterns include emerging pattern, added pattern, pattern, and unexpected change. These four change patterns are explained below. patterns: patterns is a pattern kind for knowledge discovery from databases. It is described as rules or itemsets whose supports increase significantly between time-stamped datasets (Dong, G., & Li, J., 1999). patterns can capture emerging s in time-stamped database or practical contrast between data classed (Dong, G., & Li, J., 1999). In marketing management, emerging patterns involve the same consumer behavior that exists in different periods of time with. The positive pattern growth rate (i.e. the support of a rule increases over time) indicates that the customer behavior becomes stronger over time. Meanwhile, a pattern growth rate below zero specifies that the customer behavior is getting weak. For emerging patterns, the conditional and consequent parts are the same for two rules, but support for the two rules changes significantly between different time periods (Chen et al, 2005). pattern have been proven practical as (Dong, G., & Li, J., 1999) mentioned and (Dong, G., & Li, J., 1999) believed that EPs with low to medium support, such as 1% to 20% can give useful new insights and assistance to experts. Added/Perished patterns: Added pattern defined as a rule at period t, if all conditional and consequent parts differ significantly from any rule, at period t. Perished Pattern is a rule at periodt, if all conditional and consequent parts differ significantly from any rule, at period t. A pattern is a vanished pattern found in the past but not the present. The rule matching threshold (RMT) is applied by (Chen at al, 2005) to measure the degree of change. Unexpected change: There are some works on unexpected changes in literature on mining interesting (Chen et al, 2005). Liu and Hsu (1996) classified unexpected changes into unexpected conditional changes and unexpected consequent changes. If the conditional parts of and are similar, but their consequent parts are diverse, then is an unexpected consequent change regarding (Liu & Hsu, 1996; Song 64
et al, 2001). furthermore, if the consequent parts of and are similar, but their conditional parts are different, then is an unexpected conditional change with respect to. In this research, the unexpected changes of customer behavior can be identified in the form of unexpected purchasing (consequent) patterns and customer shifting (conditional) patterns based on (Chen et al, 2005). After explaining the four customer changes, this study elaborates on the measures used to detect these changes. Change Measure: For calculating changes, we have two measures from (Chen et al, 2005), one is similarity which calculate similarity percentage between two rules and the unexpectedness measure applying when two rules don t have any similarity to discover any unexpected event in rules, either in conditional part or consequent part. Before calculating these two measures, we need some notations that defined below. ; ;, ;, ; ; ; ; ; ; ; ; 65
; l ; ;, 1, th, 0, 1,2,, ; ;, 1, th, 0, 1,2,, ; ; ; ; ; ;, 1,, 1;, 0 In this study, first, we applied two measures of similarity and unexpectedness by (Chen et al, 2005). The Similarity measure can be used to measure the degree of likeness between two rules, and unexpectedness measure can be used to identify the disparity between dissimilar rules. Two measures shown below: l, 0, 0 0 0, 0 Where l and are defined as follows: 66
l,, 2 3 In Eqs. (2) and (3), l and h represent the similarity of the conditional and consequent parts, respectively. The degree of similarity,s, is between 0 and 1, where 0 indicates that the two patterns are completely dissimilar, and 1 indicates that the two patterns are identical. For mining change, we have some steps that are as below: First, we calculate similarity measure for every rule in the first period to all of the rules from the second period and vice versa. Following calculating the similarity of patterns, the maximum similarity degrees of Rules r and r are determined to measure the change of patterns during periods t andt. The maximum degrees of similarity are represented using Eqs. (4) and (5), as below. max,,, 4 max,,, 5 According to (Chen et al, 2005), the maximum similarity provides the basis for differentiating emerging patterns, added patterns, and patterns during various periods. If the maximum similarity of Rule r, S, equals 1 (or S equals 1), then the rule exists in both time periods t1 and t2, and thus shows an emerging pattern. If a rule displays positive growth (Sup2>Sup1), then the rule represents a pattern of customer behavior that becomes strong with time. Vice versa, a growth rate below zero indicates negative of customer behavior change. If the maximum similarity of Rule r, S, lies between 0 and 1, the two rules share a partial resemblance. The decision maker determines a rule matching 67
threshold (RMT) to judge whether the similarity of a specific rule satisfies the criteria set by the individual user. If the maximum similarity of Rule r, S, is smaller than RMT (S this rule gradually perishes in time period t, and is therefore considered a pattern. Else, the rule will be not. Meanwhile, if the maximum similarity of Rule r is belowrmt S, r in period t is quite different to the rules in periodt, and thus it is considered an added pattern, else it will be not added rule. If the maximum value of similarity measure for one rule becomes 0, then unexpectedness measure is used to judge whether the two rules consist of unexpected changes. Unexpectedness was initially used as a subjective measure for interestingness of pattern. Patterns are interesting if they are surprising to the user (Silberschatz, A., & & Tuzhilin, A., 1996) In this study we have used the unexpectedness measure, introduced by (Chen et al, 2005). The measure is illustrated in equation (6) l, 0, 0 0 0, 0 If δ 0, then Rule r is an unexpected purchasing rule (i.e. unexpected consequent change) according to r. In this case, customers with same characteristics shift their purchasing behavior or buy diverse products. If δ 0, then Rule r is an unexpected shifting rule (i.e. unexpected conditional changes) according to r. This change specifies that the customer group of specific products has changed to another group. If the unexpectedness value equals 0, the two rules are completely different. If the value of unexpectedness in comparison of a rule from t and rule of period t become 0, then this is an unexpected. In addition, vice versa, it is an unexpected added. 6 Our contribution: 68
In the previous work, two measures did not use the information that ordinal data have by themselves. In this study, After mining changes by two measure introduced by (Chen et al, 2005), we modified previous measures to mine changes which use the information that ordinal numbers have by themselves. Here we want to compare ordinal values instead of binary values for each attribute. For doing so, we calculate distances between values of each common attribute of LHS and RHS. According to (Han, J., & Kamber, M., 2006), the dissimilarity (or similarity) between the objects described by interval-scaled variables is typically computed based on the distance between each pair of objects. One of the most popular distance measures is Manhattan (or city block) distance, which is defined as:,. The measures of similarity and unexpectedness are modified by using Manhattan distances. The modified measures are as follow. The distance between pth attribute of and where the pth attribute is in common in, this is based on the definition of Manhattan distance. β The distance between pth attribute of and where the qth attribute is in common in, this is based on the definition of Manhattan distance.,,,, By defining these measures we bring the information from the ordinal data that we have. 69
Chapter4: Results & Analysis Data preprocessing result Customer Segmentation result Mining customer behavior result Change mining result 70
The data pre-processing processing phase of analysis has been done in SQL server 2000 (SQL Server, 2000) and the data-mining phase performs with R package (R software, 2007). This chapter shows the analysis and result of each step in change mining process. 4.1 Data preprocessing result: 4.1.1 Data Cleaning: According to chapter3, we have some noisy data which are the customers who belongs to Kalleh Company. So we removed them from the database. During two periods that we analyze, there were 2499 customers but 42 customers belong to Kalleh Company, so we remove e them from the database. Total number of customer after removing noisy data became 2457. 4.1.2 Data Transformation result: 4.1.2.1 Generalization: The result of generalization that is explained in Chapter3, is 6 category of products which is shown in Figure 4.1 Product Categories Category1: Dairy Products Category2: Ice-Cream Category3: Meat Products(freezed or non-freezed Category5: Pitza Cheese Category11: Sauces Category13: Cooking Cheese & processed cheese Figure 4.1: generalized product category 4.1.2.2 RFM Construction: As explained in chapter2, for building the customer behavior patterns we need, customer behavioral variables. This part of the research has been done in the SQL server 2000. For calculating RFM, first we divided our dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as period one or t1 and the second one between '1384/07/01' AND '1385/06/31' as period two or t2. 71
We defined recency by calculating the interval between the last date of purchase and the last date of each period which for period. It means that the evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'. For frequency and monetary, we aggregate the transaction data to calculate the total number of purchases and total amount spent during each period. According to the market segmentation by (Marcus, C., 1998), we need the average purchasing of each customer. So we divide total purchase amount by total number of purchases to calculate average amount of each purchase. The final data that is ready to the next step for discretization has the format as illustrated in table 4.1. The fields that are extracted from database can be seen. Table 4.1: RFM table fields 4.2Customer segmentation (in SQL server 2000): According to (Marcus, C., 1998), we divided customers to four clusters in each period which include uncertain, frequent, spender and best. According to Customer Value matrix, we have two axes. The calculation steps of Customer Value Matrix and its result are in the following section. 4.2.1 Customer Value Matrix Result: According to definition of Customer Value Matrix in chapter3, we applied customer value matrix introduced by (Marcus, C., 1998). For each period, we define two variables for this matrix. One is average number of purchases and the other is average amount of purchase. In table 4.2, table 4.3, the results are shown. Period 1: RFM Table Period Customer Code(ID) Recency (days) Frequency Monetary (Average of purchase) 72
Table 4.2 : calculating variables for customer value matrix Average number of purchases = total number of purchases/total number of customers Total number of purchases = 16,424 Total number of customers = 789 Average number of purchases = 1,6424/789= 20.82 purchases per customer Average purchase amount = total sales/total number of customers Total sales average = 1,037,047,130.8 Rials Total number of customers = 789 Average purchase amount 1,037,047,130.8 Rials /789= 1,314,381.66 rials per purchase Period 2: Table 4.3: calculating variables for customer value matrix Average number of purchases = total number of purchases/total number of customers Total number of purchases = 31,061 Total number of customers = 2199 Average number of purchases = 31,061/2199= 14.1 purchases per customer Average purchase amount = total sales/total number of customers Total sales average = 2,829,589,665.4 Rials Total number of customers = 2199 Average purchase amount 2,829,589,665.4 Rials /2199= 1,286,762 per purchase Based on the customer value matrix we have four clusters. For each customer, we calculate average amount of purchase and the number of purchases. By comparing these values with the two average variables calculated before, we determine each customer belong to which cluster. When average sale of each customer is less than the average of sales and 73
number of purchase is less than the average of frequency, called uncertain segment. Else, when average sale of each customer is less than the average of sales and number of purchase is greater than the average of frequency, called frequent segment. When average sale of each customer is greater than the average of sales and number of purchase is less than the average of frequency, called spender segment. Finally, when average sale of each customer is greater than the average of sales and number of purchase is greater than the average of frequency, called best segment. Figure4.2 shows the four segments. Monetary Avg. Monetary Spender Uncertain Best Frequency Avg. Frequency Frequency Figure 4.2: The Customer Value Matrix The segmentation results in period one and two are shown in table 4.4, table 4.5 Table 4.4: segment information in For period 1. Percentage 56% 18% 11% Number of customers 441 140 85 Segment Uncertain Frequent Spender 74
16% 123 789 Best Total Table 4.5: segment information in For period 2 Percentage 61% 18% 10% 11% Number of customers 1335 389 227 248 2199 Segment Uncertain Frequent Spender Best Total Based on the Customer Value Matrix, as mentioned, for clusters are as follows: the uncertain, frequent, spenders and best. 4.3Customer Behavior Mining: In this phase, we applied association rules to analyze the patterns of customer behavior of different time snapshots for each customer cluster. For mining changes in customer behavior during different periods, we divide data to two periods and in each period, we build four clusters of customers include uncertain, spender, frequent and best. As mentioned, the purpose of this study is to mine customer behavior patterns by building association rules while customer profile data and behavioral variables (RFM) are in the conditional part and purchased products are as the consequent part. The first issue is association rule mining work with discrete variables. Therefore, in the first phase we need to do discretization. 4.3.1 Discretization Result: As explained in section in chapter3, we have various types of discretization methods. In this study we did this step in R package. For three customer behavioral variables, RFM, we have used equal frequency binning. The result of this discretization is as follows. We build four quantiles by equal frequency binning in R package for Recency, Frequency and Monetary. Here we 75
present the quantile data and histogram of R, F, M and Area separately. Table4.6: R quantile Figure4.3: R histogram 76
Table 4.8: F quantile Figure 4.5: F histogram Frequency Variable Quantile interval 1th Quantile 1 to 2 2th Quantile 2 to 5 3th Quantile 5 to 20 4th Quantile 20 to 248 For discretizing the area, based on the market expert opinion and their knowledge about the area we have define four groups. 77
4.3.2 Association Rule Mining Results: We applied Apriori algorithm to mine association rules in this research. The minimum support and confidence is 17% and we find maximal frequent itemsets. After association rules built, the association rules of each cluster for two different time periods are compared to understand the customer behavior patterns of the most valuable customers. It means that we have 8 rulesets for four cluster o f customers and in two periods. 4.4Change Mining: In this step, we compare each two rulesets related to each customer in two periods. In table 4.6you can see the summary of generated rule in each cluster and in each period. Table 4.10: Generated rule summary Number of generated association rules per cluster Cluster Period1 Period2 1 20 13 78
2 76 65 3 22 29 4 127 86 4.6. All of the mined association rules and their change types are shown in table Here, in tables below, the rules that have been made by the Chen similarity formula are seen. For each cluster, we have two ruleset for each period. In each ruleset we can see different kinds of changes in customer purchasing behavior. Cluster one is the cluster who buy frequently and their purchase amount is below the average purchase amount of total. In the generated rules and the changes in customer purchasing behavior, from the 5 kinds that we defined in the methodology, we have found 4 kinds of them in four clusters. While, there are large number of changes in customer behavior patterns, a few example of change pattern are selected from each change type to provide an explanation. 4.4.1 Some examples of change pattern: One example of emerging pattern in cluster1: t1-r5: "Area=poor, -> cat1=1" support =0.191344 t2-r8: "Area=poor, -> cat1=1" support=0.260674 Cat1 is dairy products group The growth rate is 36%. This rule shows that the poor area generally buy dairy products. The support of this rule show 36% growth means rule grows more robust over time. One example of unexpected purchasing pattern in cluster1: t 1 -r 1 : area=normal -> cat11=1 support =0.170843 cat11 is sauces t 2 -r 4 : area=normal -> cat1=1 support =0.193258 cat1 is dairy products The above rules show that the initial pattern of customer behavior is area= normal to purchase sauces category. However, in the second period, this group 79
shows that they purchase dairy products. This unexpected consequent pattern can lead marketing decision makers to enforce their marketing effort to know why this change happened and promoting dairy products to this group and to reduce promotion of sauces to normal area, thus increase customer value. One example of Added Rule: t 1 -r 24 : R=25%, -> cat2=1 support = 0.200514 While R = 25% means recency is between 0 to 5 days and cat2 is ice-cream The above rule is a newly added pattern which provides a reference for developing promotion plans to stimulate customer needs. One example of Perished Rule: t 1 -r 11 : F=100%,M=50% -> cat1=1 support =0.178571 similarity=0.333 While F=100% means Frequency is between 20 to 248 times, M=50% means Monetary is between 269471.154 Rials to 538398.005 Rials and Cat1 means Dairy products. The above rule showed that during the first period, among customers whose their frequency is between 24 and 248 times and their monetary expenditure is between 269471.154 Rials to 538398.005 Rials bought dairy product but the similarity of this rule with the generated rules in the next period is 0.33 which is lower than rule matching threshold(rmt). In marketing when we face such a situation, it means that the focus of marketing strategies should be changed from this group. The unexpected purchasing (Consequent) and unexpected shifting (condition) patterns can help to better determined where to focus. 4.4.2 Association rules and changes based (Chen et al, 2005): Table 4.11: Generated Rules for period 1 Cluster 1 Rule- Index rule1 Support Change Type Similarity Sim- Rule- Index 1 area=normal, -> 0.170843 Unexpected 0 1 cat11=1, 2 M=25%, -> cat1=1, 0.177677 1 7 3 M=50%, -> cat11=1, 0.18451 Unexpected 0 1 4 area=poor, -> cat11=1, 0.220957 1 9 80
5 area=poor, -> cat1=1, 0.191344 1 8 6 area=poor, -> cat3=1, 0.170843 1 10 7 R=75%, -> cat11=1, 0.170843 Unexpected 0 1 8 R=75%, -> cat1=1, 0.193622 1 2 9 R=100%, -> cat11=1, 0.198178 Unexpected 0 1 10 R=100%, -> cat5=1, 0.175399 Unexpected 0 1 11 M=75%, -> cat11=1, 0.198178 Unexpected 0 1 12 M=75%, -> cat1=1, 0.1959 1 3 13 M=75%, -> cat5=1, 0.220957 Unexpected 0 1 14 M=75%, -> cat3=1, 0.200456 Unexpected 0 1 15 F=25%, -> cat11=1, 0.189066 1 13 16 F=25%, -> cat1=1, 0.173121 1 12 17 F=75%, -> cat11=1, 0.218679 Unexpected 0 1 purchasing 18 F=75%, -> cat1=1, 0.23918 1 1 19 F=75%, -> cat5=1, 0.230068 Unexpected purchasing 20 F=75%, -> cat3=1, 0.173121 Unexpected purchasing 0 1 0 1 Table4.12: Generated Rules for period 2 Cluster 1 Rul- Index rule2 Support Change Type Similarity 1 F=75%, -> cat1=1, 0.18427 1 18 2 R=75%, -> cat1=1, 0.170037 1 8 3 M=75%, -> cat1=1, 0.175281 1 12 4 area=normal, -> 0.193258 Unexpected 0 1 cat1=1, purchasing 5 R=100%, -> cat1=1, 0.207491 Unexpected added 0 1 6 M=50%, -> cat1=1, 0.229213 Unexpected added 0 1 7 M=25%, -> cat1=1, 0.213483 1 2 8 area=poor, -> cat1=1, 0.260674 1 5 9 area=poor, -> cat11=1 0.182772 1 4, 10 area=poor, -> cat3=1, 0.170787 1 6 Sim- Rule- Index 81
11 F=25%, -> cat3=1, 0.170787 Unexpected added 0 1 12 F=25%, -> cat1=1, 0.277903 1 16 13 F=25%, -> cat11=1, 0.183521 1 15 Generated rules for cluster2 are as follows: Table4.13: Generated Rules for period 1 Cluster 2 Rul- Index rule1 Support Change Type Similarity Sim- Rule- Index 1 F=100%,R=50%, -> cat5=1, 0.178571 Not 0.5 12 2 F=100%,area=good, -> cat1=1, 0.178571 Perished 0.333333 25 3 F=100%,area=good, -> cat5=1, 0.178571 Not 0.5 18 4 area=good, -> cat11=1, 0.178571 Unexpected 0 1 5 F=100%,area=rich, -> cat3=1, 0.178571 Perished 0.333333 26 6 F=100%,area=rich, -> cat11=1, 0.178571 Perished 0.333333 27 7 F=100%,area=rich, -> cat1=1,cat5=1, 0.178571 Perished 0.333333 21 8 F=100%,M=50%, -> cat2=1, 0.171429 Perished 0.166667 21 9 F=100%,M=50%, -> cat3=1, 0.178571 Not 0.5 5 10 F=100%,M=50%, -> cat5=1,cat11=1, 0.178571 Perished 0.333333 22 11 F=100%,M=50%, -> cat1=1, 0.178571 Perished 0.333333 25 12 M=50%, -> cat1=1,cat5=1, 0.171429 Not 0.5 4 13 F=100%,area=normal, -> cat3=1,cat13=1, 0.171429 Not 0.5 14 14 F=100%,area=normal, -> cat11=1,cat13=1, 0.178571 Not 0.5 14 15 F=100%,area=normal, -> cat5=1,cat13=1, 0.185714 Not 0.5 14 16 area=normal, -> cat1=1,cat13=1, 0.171429 Not 0.5 13 17 F=100%,area=normal, -> cat3=1,cat11=1, 0.192857 Not 0.5 19 18 F=100%,area=normal, -> cat1=1,cat3=1, 0.171429 1 20 19 F=100%,area=normal, -> cat3=1,cat5=1, 0.178571 Not 0.5 18 20 F=100%,area=normal, -> cat1=1,cat5=1,cat11=1 0.171429 Not 0.666667 19, 21 area=normal, -> cat2=1, 0.171429 Unexpected 0 1 22 F=100%,R=25%,M=75%, -> cat1=1,cat3=1, 0.171429 Not 0.666667 30 23 F=100%,R=25%,M=75%, -> cat1=1,cat11=1, 0.171429 Not 0.666667 28 82
24 F=100%,R=25%,M=75%, -> cat5=1,cat11=1, 0.171429 Not 0.5 27 25 F=100%,R=25%,M=75%, -> cat1=1,cat5=1, 0.185714 Not 0.5 25 26 F=100%,M=75%, -> cat3=1,cat11=1,cat13=1, 0.171429 1 38 27 F=100%,M=75%, -> cat1=1,cat3=1,cat13=1, 0.171429 1 39 28 F=100%,M=75%, -> cat3=1,cat5=1,cat13=1, 0.185714 1 33 29 M=75%, -> cat5=1,cat11=1,cat13=1, 0.171429 Not 0.75 34 30 M=75%, -> cat1=1,cat11=1,cat13=1, 0.171429 Not 0.75 34 31 F=100%,M=75%, -> cat1=1,cat5=1,cat13=1, 0.178571 Not 0.666667 33 32 F=100%,M=75%, -> cat1=1,cat3=1,cat11=1, 0.2 1 45 33 M=75%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.192857 1 44 34 F=100%,M=75%, -> cat3=1,cat5=1,cat11=1, 0.2 1 42 35 F=100%,M=75%, -> cat1=1,cat3=1,cat5=1, 0.207143 1 43 36 F=100%,M=75%, -> cat1=1,cat5=1,cat11=1, 0.214286 1 41 37 M=75%, -> cat2=1, 0.171429 Unexpected 0 1 38 F=100%,R=25%, -> cat2=1,cat3=1,cat11=1, 0.171429 Not 0.666667 53 39 R=25%, -> cat1=1,cat2=1,cat3=1,cat11=1, 0.178571 Not 0.75 55 40 F=100%,R=25%, -> cat1=1,cat2=1,cat3=1, 0.192857 Not 0.666667 54 41 R=25%, -> cat1=1,cat2=1,cat3=1,cat5=1, 0.178571 Not 0.75 51 42 F=100%,R=25%, -> cat2=1,cat3=1,cat5=1, 0.178571 Not 0.666667 47 43 F=100%,R=25%, -> cat1=1,cat2=1,cat11=1, 0.235714 Not 0.666667 52 44 R=25%, -> cat1=1,cat2=1,cat5=1,cat11=1, 0.207143 Not 0.75 49 45 F=100%,R=25%, -> cat2=1,cat5=1,cat11=1, 0.214286 Not 0.666667 48 46 F=100%,R=25%, -> cat1=1,cat2=1,cat5=1, 0.235714 Not 0.666667 46 47 F=100%, -> cat2=1,cat3=1,cat11=1,cat13=1, 0.178571 Not 0.75 23 48 F=100%, -> cat1=1,cat2=1,cat3=1,cat13=1, 0.192857 Not 0.75 23 49 F=100%, -> cat2=1,cat3=1,cat5=1,cat13=1, 0.185714 Not 0.75 62 50 F=100%, -> cat1=1,cat2=1,cat11=1,cat13=1, 0.207143 Not 0.75 23 51 F=100%, -> cat2=1,cat5=1,cat11=1,cat13=1, 0.192857 Not 0.75 22 52 F=100%, -> cat1=1,cat2=1,cat5=1,cat13=1, 0.221429 Not 0.75 21 53 F=100%, -> cat1=1,cat2=1,cat3=1,cat11=1, 0.257143 1 23 54 F=100%, -> cat2=1,cat3=1,cat5=1,cat11=1, 0.228571 Not 0.75 22 55 F=100%, -> cat1=1,cat2=1,cat3=1,cat5=1, 0.257143 Not 0.75 21 83
56 F=100%, -> cat1=1,cat2=1,cat5=1,cat11=1, 0.292857 Not 0.75 21 57 F=100%,R=25%, -> cat3=1,cat11=1,cat13=1, 0.221429 1 53 58 R=25%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.214286 1 50 59 R=25%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.207143 1 55 60 F=100%,R=25%, -> cat1=1,cat3=1,cat13=1, 0.221429 1 54 61 R=25%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.214286 1 51 62 F=100%,R=25%, -> cat3=1,cat5=1,cat13=1, 0.221429 1 47 63 F=100%,R=25%, -> cat1=1,cat11=1,cat13=1, 0.228571 1 52 64 R=25%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.228571 1 49 65 F=100%,R=25%, -> cat5=1,cat11=1,cat13=1, 0.228571 1 48 66 F=100%,R=25%, -> cat1=1,cat5=1,cat13=1, 0.242857 1 46 67 F=100%,R=25%, -> cat1=1,cat3=1,cat11=1, 0.285714 1 60 68 R=25%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.271429 1 59 69 F=100%,R=25%, -> cat3=1,cat5=1,cat11=1, 0.271429 1 57 70 F=100%,R=25%, -> cat1=1,cat3=1,cat5=1, 0.292857 1 58 71 F=100%,R=25%, -> cat1=1,cat5=1,cat11=1, 0.328571 1 56 72 F=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.321429 1 64 73 F=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.35 1 63 74 F=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.342857 1 63 75 F=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.35 1 61 76 F=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.407143 1 65 Table4.14: Generated Rules for period 2 Cluster 2 Rule- Index rule2 Support Change Type Similarity Sim- Rule- Index 1 area=rich, -> cat1=1, 0.177378 Added 0.25 7 2 area=poor, -> cat1=1,cat3=1, 0.187661 Unexpected added 0 1 3 area=poor, -> cat11=1, 0.172237 Unexpected added 0 1 4 M=50%, -> cat1=1,cat11=1, 0.172237 Not Added 0.5 12 5 M=50%, -> cat3=1, 0.190231 Not Added 0.5 9 6 M=50%, -> cat5=1, 0.179949 Not Added 0.5 12 84
7 F=75%, -> cat1=1,cat11=1, 0.179949 Unexpected added 0 1 8 F=75%, -> cat1=1,cat3=1, 0.197943 Unexpected added 0 1 9 R=50%, -> cat1=1,cat13=1, 0.172237 Unexpected 0 1 purchasing 10 R=50%, -> cat1=1,cat11=1, 0.187661 Unexpected 0 1 purchasing 11 R=50%, -> cat1=1,cat3=1, 0.203085 Unexpected 0 1 purchasing 12 R=50%, -> cat5=1, 0.192802 Not Added 0.5 1 13 area=normal, -> cat1=1,cat3=1,cat11=1,cat13=1 0.172237 Not Added 0.5 16, 14 F=100%,area=normal, -> cat13=1, 0.174807 Not Added 0.5 13 15 area=normal, -> cat5=1,cat13=1, 0.18509 Not Added 0.5 15 16 area=normal, -> cat1=1,cat5=1,cat11=1, 0.177378 Not Added 0.5 20 17 area=normal, -> cat1=1,cat3=1,cat5=1, 0.182519 Added 0.333333 16 18 F=100%,area=normal, -> cat5=1, 0.172237 Not Added 0.5 1 19 F=100%,area=normal, -> cat1=1,cat11=1, 0.190231 Not Added 0.666667 20 20 F=100%,area=normal, -> cat1=1,cat3=1, 0.179949 1 18 21 F=100%, -> cat1=1,cat2=1,cat5=1, 0.179949 Not Added 0.75 52 22 F=100%, -> cat2=1,cat5=1,cat11=1, 0.177378 Not Added 0.75 51 23 F=100%, -> cat1=1,cat2=1,cat3=1,cat11=1, 0.174807 1 53 24 R=25%, -> cat2=1, 0.200514 Added 0.25 39 25 F=100%,R=25%,M=75%, -> cat1=1, 0.195373 Not Added 0.5 22 26 F=100%,R=25%,M=75%, -> cat3=1, 0.18509 Not Added 0.5 22 27 F=100%,R=25%,M=75%, -> cat11=1, 0.174807 Not Added 0.5 23 28 R=25%,M=75%, -> cat1=1,cat11=1, 0.190231 Not Added 0.666667 23 29 R=25%,M=75%, -> cat3=1,cat11=1, 0.182519 Added 0.333333 22 30 R=25%,M=75%, -> cat1=1,cat3=1, 0.195373 Not Added 0.666667 22 31 R=25%,M=75%, -> cat5=1, 0.192802 Added 0.333333 24 32 R=25%,M=75%, -> cat13=1, 0.182519 Added 0.166667 26 33 F=100%,M=75%, -> cat3=1,cat5=1,cat13=1, 0.179949 1 28 34 M=75%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.190231 Not Added 0.75 29 35 M=75%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.187661 Not Added 0.75 29 36 M=75%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.197943 Not Added 0.75 33 37 F=100%,M=75%, -> cat1=1,cat11=1,cat13=1, 0.187661 Not Added 0.666667 26 38 F=100%,M=75%, -> cat3=1,cat11=1,cat13=1, 0.18509 1 26 85
39 F=100%,M=75%, -> cat1=1,cat3=1,cat13=1, 0.192802 1 27 40 M=75%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.226221 Not Added 0.75 30 41 F=100%,M=75%, -> cat1=1,cat5=1,cat11=1, 0.197943 1 36 42 F=100%,M=75%, -> cat3=1,cat5=1,cat11=1, 0.197943 1 34 43 F=100%,M=75%, -> cat1=1,cat3=1,cat5=1, 0.213368 1 35 44 M=75%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.226221 1 33 45 F=100%,M=75%, -> cat1=1,cat3=1,cat11=1, 0.233933 1 32 46 F=100%,R=25%, -> cat1=1,cat5=1,cat13=1, 0.182519 1 66 47 F=100%,R=25%, -> cat3=1,cat5=1,cat13=1, 0.187661 1 62 48 F=100%,R=25%, -> cat5=1,cat11=1,cat13=1, 0.172237 1 65 49 R=25%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.174807 1 64 50 R=25%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.177378 1 58 51 R=25%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.187661 1 61 52 F=100%,R=25%, -> cat1=1,cat11=1,cat13=1, 0.192802 1 63 53 F=100%,R=25%, -> cat3=1,cat11=1,cat13=1, 0.195373 1 57 54 F=100%,R=25%, -> cat1=1,cat3=1,cat13=1, 0.203085 1 60 55 R=25%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.200514 1 59 56 F=100%,R=25%, -> cat1=1,cat5=1,cat11=1, 0.197943 1 71 57 F=100%,R=25%, -> cat3=1,cat5=1,cat11=1, 0.205656 1 69 58 F=100%,R=25%, -> cat1=1,cat3=1,cat5=1, 0.215938 1 70 59 R=25%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.208226 1 68 60 F=100%,R=25%, -> cat1=1,cat3=1,cat11=1, 0.244216 1 67 61 F=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.257069 1 75 62 F=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.25964 1 73 63 F=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.272494 1 74 64 F=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.298201 1 72 65 F=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.303342 1 76 Table4.15: Generated Rules for period 1 Cluster 3 Rul- Index rule1 Support Change Type Similarity 1 F=50%,M=100%, -> cat5=1, 0.188235294 86 Sim- Rule- Index 1 15
2 M=100%,area=good, -> cat5=1, 0.188235294 Not 0.5 1 3 R=50%,M=100%, -> cat5=1, 0.211764706 1 1 4 F=25%,M=100%, -> cat5=1, 0.211764706 1 9 5 R=75%,M=100%, -> cat5=1, 0.188235294 1 6 6 M=100%,area=poor, -> cat11=1, 0.176470588 Not 0.5 14 7 M=100%,area=poor, -> cat5=1, 0.258823529 Not 0.5 1 8 M=100%,area=normal, -> cat5=1, 0.235294118 1 4 9 R=100%,M=100%, -> cat5=1, 0.282352941 1 12 10 F=75%,M=100%, -> cat1=1,cat13=1, 0.2 Not 0.5 16 11 F=75%,M=100%, -> cat3=1,cat13=1, 0.188235294 Not 0.5 16 12 F=75%,M=100%, -> cat11=1,cat13=1, 0.176470588 Not 0.5 16 13 F=75%,M=100%, -> cat5=1,cat13=1, 0.2 1 16 14 M=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.211764706 1 25 15 M=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.211764706 1 27 16 M=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.223529412 1 26 17 M=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.247058824 1 28 18 F=75%,M=100%, -> cat1=1,cat3=1,cat11=1, 0.176470588 Not 0.666666667 18 19 F=75%,M=100%, -> cat1=1,cat3=1,cat5=1, 0.2 Not 0.666666667 20 20 F=75%,M=100%, -> cat1=1,cat5=1,cat11=1, 0.188235294 Not 0.666666667 19 21 F=75%,M=100%, -> cat3=1,cat5=1,cat11=1, 0.188235294 Not 0.666666667 18 22 M=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.247058824 Table4.16: Generated Rules for period 2 Cluster 3 Rule- Index rule1 Support Change Type Similarity 1 R=50%,M=100%, -> cat5=1, 0.171806 1 3 1 29 Sim- Rule- Index 87
2 M=100%,area=normal, -> cat1=1, 0.185022 Added 0.25 10 3 M=100%,area=normal, -> cat3=1, 0.180617 Added 0.25 11 4 M=100%,area=normal, -> cat5=1, 0.189427 1 8 5 R=75%,M=100%, -> cat3=1, 0.171806 Added 0.25 11 6 R=75%,M=100%, -> cat5=1, 0.193833 1 5 7 F=25%,M=100%, -> cat1=1, 0.171806 Added 0.25 10 8 F=25%,M=100%, -> cat3=1, 0.171806 Added 0.25 11 9 F=25%,M=100%, -> cat5=1, 0.220264 1 4 10 R=100%,M=100%, -> cat1=1,cat11=1, 0.171806 Added 0.333333 18 11 R=100%,M=100%, -> cat3=1,cat11=1, 0.171806 Added 0.333333 18 12 R=100%,M=100%, -> cat5=1, 0.229075 1 9 13 F=50%,M=100%, -> cat1=1,cat3=1, 0.171806 Added 0.333333 18 14 F=50%,M=100%, -> cat11=1, 0.1982 Not 0.5 6 38 Added 15 F=50%,M=100%, -> cat5=1, 0.22467 1 1 16 F=75%,M=100%, -> cat5=1,cat13=1, 0.180617 1 13 17 F=75%,M=100%, -> cat1=1, 0.189427 Not 0.5 10 Added 18 F=75%,M=100%, -> cat3=1,cat11=1, 0.193833 Not 0.666667 18 Added 19 F=75%,M=100%, -> cat5=1,cat11=1, 0.180617 Not 0.666667 20 Added 20 F=75%,M=100%, -> cat3=1,cat5=1, 0.202643 Not 0.666667 19 Added 21 M=100%,area=poor, -> cat1=1, 0.198238 Added 0.25 10 22 M=100%,area=poor, -> cat3=1,cat11=1, 0.185022 Not Added 0.5 6 88
23 M=100%,area=poor, -> cat5=1,cat11=1, 0.193833 Not 0.5 6 Added 24 M=100%,area=poor, -> cat3=1,cat5=1, 0.202643 Not 0.5 7 Added 25 M=100%, -> cat1=1,cat3=1,cat11=1 0.23348 1 14,cat13=1, 26 M=100%, -> cat1=1,cat5=1,cat11=1 0.237885 1 16,cat13=1, 27 M=100%, -> cat1=1,cat3=1,cat5=1,cat13=1 0.264317 1 15, 28 M=100%, -> cat3=1,cat5=1,cat11=1 0.264317 1 17,cat13=1, 29 M=100%, -> cat1=1,cat3=1,cat5=1,cat11=1 0.303965 1 22, Table4.17: Generated Rules for period 1 Cluster 4 Rule Index rule1 Support Change Type Similarity 1 F=100%,M=100%,area=rich, -> cat3=1, 0.178862 1 4 2 F=100%,M=100%,area=rich, -> cat5=1, 0.195122 Not 0.666667 30 3 M=100%, -> cat2=1,cat11=1, 0.170732 Not 0.666667 11 4 M=100%, -> cat2=1,cat3=1,cat5=1, 0.170732 Not 0.75 17 5 F=100%,M=100%, -> cat1=1,cat2=1,cat5=1, 0.178862 1 16 6 R=25%,M=100%,area=poor, -> cat3=1,cat5=1, 0.170732 Not 0.666667 10 7 F=100%,R=25%,area=poor, -> cat3=1,cat5=1, 0.170732 Not 0.666667 65 8 F=100%,R=25%,M=100%,area=poor, -> cat3=1, 0.170732 Not 0.5 4 9 R=25%,M=100%,area=poor, -> cat5=1,cat13=1, 0.186992 Not 0.666667 57 10 F=100%,R=25%,area=poor, -> cat5=1,cat13=1, 0.186992 Not 0.666667 57 11 F=100%,R=25%,M=100%,area=poor, -> cat13=1, 0.186992 Not 0.5 7 12 F=100%,R=25%,M=100%,area=poor, -> cat5=1, 0.195122 Not 0.5 30 89 Sim- Rule- Index
13 M=100%,area=poor, -> cat1=1,cat3=1,cat11=1, 0.170732 Not 0.666667 8 14 area=poor, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.170732 Perished 0.25 8 15 area=poor, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.170732 Perished 0.25 8 16 M=100%,area=poor, -> cat3=1,cat11=1,cat13=1, 0.178862 Not 0.5 39 17 F=100%,area=poor, -> cat3=1,cat11=1,cat13=1, 0.170732 Not 0.5 34 18 area=poor, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.178862 Perished 0.25 10 19 M=100%,area=poor, -> cat3=1,cat5=1,cat11=1, 0.186992 Not 0.666667 10 20 F=100%,area=poor, -> cat3=1,cat5=1,cat11=1, 0.178862 Not 0.5 46 21 F=100%,M=100%,area=poor, -> cat3=1,cat11=1, 0.178862 Not 0.666667 45 22 M=100%,area=poor, -> cat1=1,cat11=1,cat13=1, 0.170732 Not 0.5 36 23 area=poor, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.170732 Perished 0.25 9 24 M=100%,area=poor, -> cat1=1,cat5=1,cat11=1, 0.170732 Not 0.666667 9 25 M=100%,area=poor, -> cat5=1,cat11=1,cat13=1, 0.186992 Not 0.5 41 26 F=100%,area=poor, -> cat5=1,cat11=1,cat13=1, 0.178862 Not 0.5 33 27 F=100%,M=100%,area=poor, -> cat11=1,cat13=1, 0.178862 Not 0.666667 32 28 F=100%,M=100%,area=poor, -> cat5=1,cat11=1, 0.186992 Not 0.666667 47 29 M=100%,area=poor, -> cat1=1,cat3=1,cat13=1, 0.186992 Not 0.666667 8 30 F=100%,area=poor, -> cat1=1,cat3=1,cat13=1, 0.178862 Not 0.5 54 31 area=poor, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.186992 Perished 0.25 8 32 M=100%,area=poor, -> cat1=1,cat3=1,cat5=1, 0.186992 Not 0.666667 8 33 F=100%,area=poor, -> cat1=1,cat3=1,cat5=1, 0.178862 Not 0.5 63 34 F=100%,M=100%,area=poor, -> cat1=1,cat3=1, 0.178862 Not 0.666667 8 90
35 M=100%,area=poor, -> cat3=1,cat5=1,cat13=1, 0.203252 Not 0.666667 10 36 F=100%,area=poor, -> cat3=1,cat5=1,cat13=1, 0.195122 Not 0.5 56 37 F=100%,M=100%,area=poor, -> cat3=1,cat13=1, 0.195122 Not 0.666667 55 38 F=100%,M=100%,area=poor, -> cat3=1,cat5=1, 0.203252 Not 0.666667 10 39 M=100%,area=poor, -> cat1=1,cat5=1,cat13=1, 0.195122 Not 0.666667 9 40 F=100%,area=poor, -> cat1=1,cat5=1,cat13=1, 0.186992 Not 0.5 53 41 F=100%,M=100%,area=poor, -> cat1=1,cat13=1, 0.186992 Not 0.666667 52 42 F=100%,M=100%,area=poor, -> cat1=1,cat5=1, 0.186992 Not 0.666667 9 43 F=100%,M=100%,area=poor, -> cat5=1,cat13=1, 0.219512 Not 0.666667 57 44 M=100%,area=normal, -> cat3=1,cat5=1,cat11=1, 0.170732 Not 0.666667 25 45 M=100%,area=normal, -> cat1=1,cat11=1,cat13=1, 0.170732 Not 0.666667 24 46 area=normal, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.170732 Perished 0.25 24 47 M=100%,area=normal, -> cat1=1,cat5=1,cat11=1, 0.186992 Not 0.666667 24 48 F=100%,area=normal, -> cat1=1,cat5=1,cat11=1, 0.170732 Not 0.5 43 49 F=100%,M=100%,area=normal, -> cat1=1,cat11=1, 0.170732 Not 0.666667 24 50 M=100%,area=normal, -> cat5=1,cat11=1,cat13=1, 0.178862 Not 0.666667 26 51 F=100%,area=normal, -> cat5=1,cat11=1,cat13=1, 0.170732 Not 0.5 33 52 F=100%,M=100%,area=normal, -> cat11=1,cat13=1, 0.170732 Not 0.666667 32 53 F=100%,M=100%,area=normal, -> cat5=1,cat11=1, 0.186992 Not 0.666667 26 54 M=100%,area=normal, -> cat1=1,cat3=1,cat13=1, 0.170732 Not 0.666667 27 55 area=normal, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.170732 Perished 0.375 28 56 M=100%,area=normal, -> cat1=1,cat3=1,cat5=1, 0.178862 1 31 57 M=100%,area=normal, -> cat3=1,cat5=1,cat13=1, 0.178862 1 28 58 F=100%,area=normal, -> cat3=1,cat5=1,cat13=1, 0.170732 Not 0.5 28 59 F=100%,M=100%,area=normal, -> cat3=1,cat13=1, 0.178862 Not 0.666667 55 60 F=100%,M=100%,area=normal, -> cat3=1,cat5=1, 0.170732 Not 0.666667 21 61 M=100%,area=normal, -> cat1=1,cat5=1,cat13=1, 0.186992 Not 0.666667 27 62 F=100%,area=normal, -> cat1=1,cat5=1,cat13=1, 0.178862 Not 0.5 53 63 F=100%,M=100%,area=normal, -> cat1=1,cat13=1, 0.178862 Not 0.666667 27 64 F=100%,M=100%,area=normal, -> cat1=1,cat5=1, 0.195122 Not 0.666667 64 65 F=100%,M=100%,area=normal, -> cat5=1,cat13=1, 0.203252 Not 0.666667 57 66 F=100%,R=25%,M=100%,area=good, -> cat1=1, 0.186992 Not 0.5 3 67 F=100%,R=25%,M=100%,area=good, -> cat13=1, 0.186992 Not 0.75 19 68 F=100%,M=100%,area=good, -> cat1=1,cat11=1, 0.178862 Not 0.666667 42 69 F=100%,M=100%,area=good, -> cat5=1,cat11=1, 0.170732 Not 0.666667 47 91
70 F=100%,M=100%,area=good, -> cat3=1, 0.186992 Not 0.666667 4 71 F=100%,M=100%,area=good, -> cat1=1,cat13=1, 0.186992 Not 0.666667 52 72 F=100%,M=100%,area=good, -> cat1=1,cat5=1, 0.186992 Not 0.666667 23 73 R=25%,M=100%, -> cat1=1,cat3=1,cat11=1, 0.439024 74 F=100%,R=25%, -> cat1=1,cat3=1,cat11=1, 0.422764 75 R=25%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.406504 76 R=25%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.406504 77 R=25%,M=100%, -> cat3=1,cat11=1,cat13=1, 0.439024 78 F=100%,R=25%, -> cat3=1,cat11=1,cat13=1, 0.430894 79 R=25%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.414634 80 R=25%,M=100%, -> cat3=1,cat5=1,cat11=1, 0.463415 81 F=100%,R=25%, -> cat3=1,cat5=1,cat11=1, 0.447154 82 F=100%,R=25%,M=100%, -> cat3=1,cat11=1, 0.479675 83 R=25%,M=100%, -> cat1=1,cat11=1,cat13=1, 0.447154 84 F=100%,R=25%, -> cat1=1,cat11=1,cat13=1, 0.439024 85 R=25%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.422764 86 R=25%,M=100%, -> cat1=1,cat5=1,cat11=1, 0.447154 87 F=100%,R=25%, -> cat1=1,cat5=1,cat11=1, 0.430894 88 F=100%,R=25%,M=100%, -> cat1=1,cat11=1, 0.463415 89 R=25%,M=100%, -> cat5=1,cat11=1,cat13=1, 0.463415 90 F=100%,R=25%, -> cat5=1,cat11=1,cat13=1, 0.455285 91 F=100%,R=25%,M=100%, -> cat11=1,cat13=1, 0.479675 92 F=100%,R=25%,M=100%, -> cat5=1,cat11=1, 0.504065 93 R=25%,M=100%, -> cat1=1,cat3=1,cat13=1, 0.455285 1 48 1 44 1 49 1 38 1 39 1 34 1 40 1 51 1 46 1 45 1 36 1 35 1 37 1 50 1 43 1 42 1 41 1 33 1 32 1 47 1 58 92
94 F=100%,R=25%, -> cat1=1,cat3=1,cat13=1, 0.447154 1 54 95 R=25%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.422764 1 59 96 R=25%,M=100%, -> cat1=1,cat3=1,cat5=1, 0.455285 1 66 97 F=100%,R=25%, -> cat1=1,cat3=1,cat5=1, 0.439024 1 63 98 F=100%,R=25%,M=100%, -> cat1=1,cat3=1, 0.479675 1 62 99 R=25%,M=100%, -> cat3=1,cat5=1,cat13=1, 0.471545 1 61 100 F=100%,R=25%, -> cat3=1,cat5=1,cat13=1, 0.463415 1 56 101 F=100%,R=25%,M=100%, -> cat3=1,cat13=1, 0.495935 1 55 102 F=100%,R=25%,M=100%, -> cat3=1,cat5=1, 0.512195 1 65 103 R=25%,M=100%, -> cat1=1,cat5=1,cat13=1, 0.479675 1 60 104 F=100%,R=25%, -> cat1=1,cat5=1,cat13=1, 0.471545 1 53 105 F=100%,R=25%,M=100%, -> cat1=1,cat13=1, 0.520325 1 52 106 F=100%,R=25%,M=100%, -> cat1=1,cat5=1, 0.512195 1 64 107 F=100%,R=25%,M=100%, -> cat5=1,cat13=1, 0.544715 1 57 108 M=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.544715 1 73 109 F=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.528455 1 69 110 M=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.560976 1 80 111 F=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.536585 1 77 112 F=100%,M=100%, -> cat1=1,cat3=1,cat11=1, 0.569106 1 76 113 M=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.560976 1 75 114 F=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.544715 1 71 115 F=100%,M=100%, -> cat3=1,cat11=1,cat13=1, 0.569106 1 70 116 F=100%,M=100%, -> cat3=1,cat5=1,cat11=1, 0.601626 1 79 93
117 M=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.569106 118 F=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.552846 119 F=100%,M=100%, -> cat1=1,cat11=1,cat13=1, 0.577236 120 F=100%,M=100%, -> cat1=1,cat5=1,cat11=1, 0.601626 121 F=100%,M=100%, -> cat5=1,cat11=1,cat13=1, 0.601626 122 M=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.585366 123 F=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.569106 124 F=100%,M=100%, -> cat1=1,cat3=1,cat13=1, 0.601626 125 F=100%,M=100%, -> cat1=1,cat3=1,cat5=1, 0.609756 126 F=100%,M=100%, -> cat3=1,cat5=1,cat13=1, 0.626016 127 F=100%,M=100%, -> cat1=1,cat5=1,cat13=1, 0.642276 1 74 1 68 1 67 1 78 1 72 1 85 1 82 1 81 1 86 1 84 1 83 Table4.18: Generated Rules for period 2 Cluster 4 Rule- Index rule1 Support Change Type Similarity 1 R=50%,M=100%, -> cat3=1, 0.173387 Added 0.333333 1 2 M=100%,area=rich, -> cat13=1, 0.177419 Added 0.25 11 3 F=100%,M=100%,area=rich, -> cat1=1, 0.173387 Not Added 0.5 66 4 F=100%,M=100%,area=rich, -> cat3=1, 0.177419 1 1 5 M=100%,area=rich, -> cat1=1,cat3=1, 0.181452 Added 0.333333 1 6 M=100%,area=rich, -> cat3=1,cat5=1, 0.173387 Added 0.333333 1 7 M=100%,area=poor, -> cat13=1, 0.173387 Not Added 0.5 11 8 M=100%,area=poor, -> cat1=1,cat3=1, 0.177419 Not Added 0.666667 13 9 M=100%,area=poor, -> cat1=1,cat5=1, 0.173387 Not Added 0.666667 24 94 Sim- Rule- Index
10 M=100%,area=poor, -> cat3=1,cat5=1, 0.197581 Not Added 0.666667 6 11 M=100%, -> cat1=1,cat2=1,cat11=1, 0.181452 Not Added 0.666667 3 12 M=100%, -> cat2=1,cat3=1,cat11=1, 0.177419 Not Added 0.666667 3 13 M=100%, -> cat2=1,cat5=1,cat11=1, 0.177419 Not Added 0.666667 3 14 M=100%, -> cat2=1,cat13=1, 0.173387 Not Added 0.5 3 15 F=100%,M=100%, -> cat1=1,cat2=1,cat3=1, 0.173387 Not Added 0.666667 5 16 F=100%,M=100%, -> cat1=1,cat2=1,cat5=1, 0.173387 1 5 17 M=100%, -> cat1=1,cat2=1,cat3=1,cat5=1, 0.185484 Not Added 0.75 4 18 M=100%,area=good, -> cat11=1, 0.181452 Added 0.333333 68 19 F=100%,M=100%,area=good, -> cat13=1, 0.189516 Not Added 0.75 67 20 M=100%,area=good, -> cat5=1,cat13=1, 0.173387 Added 0.333333 9 21 F=100%,M=100%,area=good, -> cat3=1,cat5=1, 0.177419 Not Added 0.666667 38 22 M=100%,area=good, -> cat1=1,cat3=1, 0.173387 Added 0.333333 13 23 M=100%,area=good, -> cat1=1,cat5=1, 0.173387 Not Added 0.666667 72 24 M=100%,area=normal, -> cat1=1,cat11=1, 0.181452 Not Added 0.666667 45 25 M=100%,area=normal, -> cat3=1,cat11=1, 0.181452 Not Added 0.666667 44 26 M=100%,area=normal, -> cat5=1,cat11=1, 0.189516 Not Added 0.666667 44 27 M=100%,area=normal, -> cat1=1,cat13=1, 0.177419 Not Added 0.666667 45 28 M=100%,area=normal, -> cat3=1,cat5=1,cat13=1, 0.185484 1 57 29 F=100%,M=100%,area=normal, -> cat3=1, 0.173387 Not Added 0.666667 1 30 F=100%,M=100%,area=normal, -> cat5=1, 0.185484 Not Added 0.666667 2 31 M=100%,area=normal, -> cat1=1,cat3=1,cat5=1, 0.177419 1 56 32 F=100%,R=25%,M=100%, -> cat11=1,cat13=1, 0.375 1 91 33 F=100%,R=25%, -> cat5=1,cat11=1,cat13=1, 0.358871 1 90 34 F=100%,R=25%, -> cat3=1,cat11=1,cat13=1, 0.350806 1 78 35 F=100%,R=25%, -> cat1=1,cat11=1,cat13=1, 0.326613 1 84 36 R=25%,M=100%, -> cat1=1,cat11=1,cat13=1, 0.346774 1 83 37 R=25%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.330645 1 85 38 R=25%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.326613 1 76 39 R=25%,M=100%, -> cat3=1,cat11=1,cat13=1, 0.375 1 77 40 R=25%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.362903 1 79 41 R=25%,M=100%, -> cat5=1,cat11=1,cat13=1, 0.383065 1 89 42 F=100%,R=25%,M=100%, -> cat1=1,cat11=1, 0.370968 1 88 95
43 F=100%,R=25%, -> cat1=1,cat5=1,cat11=1, 0.350806 1 87 44 F=100%,R=25%, -> cat1=1,cat3=1,cat11=1, 0.350806 1 74 45 F=100%,R=25%,M=100%, -> cat3=1,cat11=1, 0.395161 1 82 46 F=100%,R=25%, -> cat3=1,cat5=1,cat11=1, 0.379032 1 81 47 F=100%,R=25%,M=100%, -> cat5=1,cat11=1, 0.403226 1 92 48 R=25%,M=100%, -> cat1=1,cat3=1,cat11=1, 0.383065 1 73 49 R=25%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.366935 1 75 50 R=25%,M=100%, -> cat1=1,cat5=1,cat11=1, 0.387097 1 86 51 R=25%,M=100%, -> cat3=1,cat5=1,cat11=1, 0.415323 1 80 52 F=100%,R=25%,M=100%, -> cat1=1,cat13=1, 0.399194 1 105 53 F=100%,R=25%, -> cat1=1,cat5=1,cat13=1, 0.375 1 104 54 F=100%,R=25%, -> cat1=1,cat3=1,cat13=1, 0.370968 1 94 55 F=100%,R=25%,M=100%, -> cat3=1,cat13=1, 0.439516 1 101 56 F=100%,R=25%, -> cat3=1,cat5=1,cat13=1, 0.419355 1 100 57 F=100%,R=25%,M=100%, -> cat5=1,cat13=1, 0.455645 1 107 58 R=25%,M=100%, -> cat1=1,cat3=1,cat13=1, 0.395161 1 93 59 R=25%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.379032 1 95 60 R=25%,M=100%, -> cat1=1,cat5=1,cat13=1, 0.399194 1 103 61 R=25%,M=100%, -> cat3=1,cat5=1,cat13=1, 0.451613 1 99 62 F=100%,R=25%,M=100%, -> cat1=1,cat3=1, 0.423387 1 98 63 F=100%,R=25%, -> cat1=1,cat3=1,cat5=1, 0.403226 1 97 64 F=100%,R=25%,M=100%, -> cat1=1,cat5=1, 0.431452 1 106 65 F=100%,R=25%,M=100%, -> cat3=1,cat5=1, 0.471774 1 102 66 R=25%,M=100%, -> cat1=1,cat3=1,cat5=1, 0.443548 1 96 67 F=100%,M=100%, -> cat1=1,cat11=1,cat13=1, 0.447581 1 119 68 F=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.419355 1 118 69 F=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.427419 1 109 70 F=100%,M=100%, -> cat3=1,cat11=1,cat13=1, 0.479839 1 115 71 F=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.455645 1 114 72 F=100%,M=100%, -> cat5=1,cat11=1,cat13=1, 0.475806 1 121 73 M=100%, -> cat1=1,cat3=1,cat11=1,cat13=1, 0.491935 1 108 74 M=100%, -> cat1=1,cat5=1,cat11=1,cat13=1, 0.487903 1 117 75 M=100%, -> cat3=1,cat5=1,cat11=1,cat13=1, 0.524194 1 113 96
76 F=100%,M=100%, -> cat1=1,cat3=1,cat11=1, 0.491935 1 112 77 F=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.455645 1 111 78 F=100%,M=100%, -> cat1=1,cat5=1,cat11=1, 0.471774 1 120 79 F=100%,M=100%, -> cat3=1,cat5=1,cat11=1, 0.508065 1 116 80 M=100%, -> cat1=1,cat3=1,cat5=1,cat11=1, 0.560484 1 110 81 F=100%,M=100%, -> cat1=1,cat3=1,cat13=1, 0.512097 1 124 82 F=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.479839 1 123 83 F=100%,M=100%, -> cat1=1,cat5=1,cat13=1, 0.508065 1 127 84 F=100%,M=100%, -> cat3=1,cat5=1,cat13=1, 0.560484 1 126 85 M=100%, -> cat1=1,cat3=1,cat5=1,cat13=1, 0.552419 1 122 86 F=100%,M=100%, -> cat1=1,cat3=1,cat5=1, 0.544355 1 125 4.4.3 Rules with discrete variables in RHS: According to the (Chen et al, 2005), in the RHS parts of the rules, we have the customer buy the products or not. In this section, we build some rules that show how many times a product category bought and compare them by Chen similarity formula. Then we have modified Chen similarity formula by Manhattan distance formula which calculates difference between the values of each attribute in two rules. For each cluster we have 4 ruleset. For each period we have one itemset and we compare generated rules once by Chen similarity formula and the other time by modified formula with Manhattan distance. We have two steps, first discretizing the number of purchases for each product category and the second is generating association rules and comparing them. 1. Discretization of number f purchases for each product category: First, before starting, we discretize the frequency of purchases of each product for each customer during time. It means that if cat1=20 then the selected customer purchase cat1, 20 times in the selected period. The number of purchases for each product category discretized and its results with their histogram are as followed. 97
98
99
Table4.21: Cat3 quantile Figure4.9: Cat3 histogram Category3 Variable Quantile interval 1th Quantile 0 to 0 2th Quantile 0 to 1 3th Quantile 1 to 9 4th Quantile 9 to 1050 100
101
Table4.23: Cat11 quantile Figure4.11: Cat11 histogram Category11 Variable Quantile interval 1th Quantile 0 to 0 2th Quantile 0 to 1 3th Quantile 1 to 5 4th Quantile 5 to 238 102
Table4.24: Cat13 quantile Figure4.12: Cat13 histogram Category13 Variable Quantile interval 1th Quantile 0 to 0 2th Quantile 0 to 0 3th Quantile 1 to 2 4th Quantile 2 to 122 4.4.4 Change mining with Manhattan distance The outcome of generated rules and changes are as followed. Cluster1: Change mining by (Chen et al, 2005) measures & by Manhattan distance Table4.25: Generated Rules for period 1 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance Rule Index rule1 Support 1 F=0.25, -> cat11=0.25, 2 F=0.25, -> cat3=0.25, 3 F=0.25, -> cat1=0.25, 4 R=1, -> cat1=0.25, Change Type 0.12984 0.10251 Similarity Change Type -M 1.000 1.000 0.14123 Not 0.500 Not 0.10251 Not 0.500 Not Similarity -M Sim- Rule - Index1 Sim- Rule- Index2 1.000 2 2 1.000 1 1 0.500 4 4 0.500 4 4 103
5 R=0.75, -> cat1=0.25, 6 M=0.25, -> cat1=0.25, 7 F=0.25, -> cat5=0.25, 8 M=0.75, -> cat5=0.25, 9 R=1, -> cat5=0.25, 10 R=0.75, -> cat5=0.25, 0.11162 Unexpected 0.12756 Not 0.500 Not 0.13212 1.000 0.12301 Unexpected 0.000 Unexpected 0.10023 Unexpected 0.000 Unexpected 0.10251 Unexpected 0.000 Unexpected 0.000 Perished 0.375 1 4 0.500 5 5 1.000 3 3 0.000 1 1 0.000 1 1 0.000 1 1 ated Rules Table 4.26: Generated rules for period 2 Cluster 1, Change mining by (Chen et al, 2005) measures & Rule Index rule 2 Support Change Type Similarity Change Type - M Similarity -M Sim- Rule Inde x1 Sim- Rule Index2 1 F=0.25, - > cat3=0.25, 2 F=0.25, - > cat11=0.25, 3 F=0.25, - > cat5=0.25, 4 F=0.25,R=1, -> cat1=0.25, 5 F=0.25,M=0.25, -> cat1=0.25 0.12734 1.000 1.000 2 2 0.13483 1.000 1.000 1 1 0.12509 1.000 1.000 7 7 0.10037 Not Added 0.10337 Not Added 0.500 Not Added 0.500 Not Added 0.500 3 3 0.500 3 3 104
6 F=0.25,area=1, - > cat1=0.25 0.10337 Not Added 0.500 Not Added 0.500 3 3 Cluster2:Change mining by (Chen et al, 2005) measures & by Manhattan distance Rule Index Table4.27: Generated Rules for period 1 Cluster 2, Change mining by(chen et al, 2005) measures & Manhattan distance Change Similarity Sim- Type -M -M Rulerule1 Support Change Type Similarity Index 1 Sim- Rule- Index 2 1 F=1, -> cat1=0.25, 0.10714 Unexpected purchasing 2 F=1, -> cat11=0.25, 0.10714 Unexpected purchasing 3 F=1, -> cat11=0.75, 0.12857 0.000 Not 0.000 Not 1.000 0.500 1 11 0.500 1 1 1.000 1 1 4 F=1,R=0.25, -> cat5=0.75, 5 F=1, -> cat3=0.75,cat11=1, 6 F=1,R=0.25, -> cat3=0.75, 7 M=0.75, -> cat3=0.75, 8 F=1,M=0.75, -> cat1=0.75, 9 F=1, -> cat1=0.75,cat5=1, 0.10714 Not 0.500 Not 0.11429 Not 0.500 Not 0.10000 Not 0.500 Not 0.10000 1.000 0.10000 Not 0.500 Not 0.10000 Not 0.500 Not 0.500 6 6 0.583 8 27 0.500 8 8 1.000 9 9 0.656 11 22 0.500 11 11 105
10 F=1,R=0.25, -> cat1=0.75, 0.10000 Not 0.500 Not 0.656 11 24 11 area=0.5, -> cat1=0.75, 0.10000 Unexpected 0.000 Not 0.563 1 7 12 F=1,R=0.5, -> cat5=1, 13 F=1,area=0.75, -> cat13=0.75, 14 F=1,M=0.75, -> cat13=0.75, 15 F=1,R=0.25, -> cat13=0.75, 16 F=1,area=0.5, -> cat5=1, 17 F=1, -> cat1=1,cat13=1, 18 F=1, -> cat5=1,cat13=1, 19 F=1,R=0.25, -> cat13=1, 20 M=0.75, -> cat13=1, 21 F=1,R=0.25,area=0.25, -> cat1=1, 22 F=1,area=0.25, -> cat5=1, 23 F=1,M=0.5, -> cat2=1, 24 F=1,M=0.5, -> cat1=1, 25 F=1,M=0.5, -> cat5=1, 26 R=0.25, -> cat2=1,cat3=1, 27 F=1,area=0.75, -> cat3=1, 28 F=1,R=0.25, -> cat3=1,cat11=1, 29 F=1,R=0.25, -> cat1=1,cat3=1, 0.12143 Not 0.500 Not 0.500 17 17 0.10000 Not 0.500 Not 0.500 19 19 0.10714 Not 0.500 Not 0.500 19 19 0.11429 Not 0.500 Not 0.750 19 4 0.11429 Not 0.500 Not 0.500 17 17 0.13571 Perished 0.333 Perished 0.375 27 11 0.10000 Not 0.500 Not 0.14286 1.000 0.10000 1.000 0.10000 Not 0.667 Not 0.13571 Not 0.500 Not 0.12143 Not 0.500 Not 0.10714 1.000 0.15000 Not 0.500 Not 0.10000 Not 0.500 Not 0.10714 1.000 0.10000 Not 0.500 Not 0.10000 1.000 0.500 16 16 1.000 4 4 1.000 5 5 0.833 35 35 0.500 17 17 0.500 2 2 1.000 22 22 0.875 17 17 0.500 3 3 1.000 28 28 0.500 26 26 1.000 30 30 106
Rule Index Table4. 28: Generated Rules for period 2 Cluster 2, Change mining by (Chen e t al, 2005) measures &Manhattan distance Sim- Rule- Index1 rule2 Support Change Type Similarity Change Type -M Similarity -M Sim- Rule- Index2 1 F=1, -> cat11=0.75, 2 F=1, -> cat2=1, 3 R=0.25, -> cat2=1, 4 F=1,R=0.25, - > cat13=1 5 M=0.75, -> cat13=1, 6 F=1, -> cat5=0.75, 7 area=0.25, -> cat 1=1, 8 F=1, -> cat3=0.75, 9 M=0.75, -> cat3=0.75, 10 F=0.75, -> cat3=0.75, F=1, -> cat1=0.75, 12 R=0.25, -> cat1=0.75, 13 M=0.75, -> cat1=0.75 0.10283 1.000 1.000 3 3 0.12339 Not Added 0.500 Not Added 0.500 23 23 0.11054 Not Added 0.500 Not 0.500 26 26 Added 0.10026 1.000 1.000 19 19 0.10283 1.000 1.000 20 20 0.11568 Not Added 0.500 Not Added 0.500 4 4 0.10540 Added 0.333 Not 0.563 21 11 Added 0.11054 Not Added 0.500 Not 0.500 5 5 Added 0.11825 1.000 1.000 7 7 0.10283 Unexpected added 0.000 Added 0.375 1 5 0.11054 Not Added 0.500 Not Added 0.11054 Not Added 0.500 Not Added 0.10797 Not Added 0.500 Not Added 0.500 8 1 0.500 10 10 0.500 8 8 107
14 15 16 17 F=0.75, -> cat1=0.75, area=1, -> cat1=1, 0.10283 0.13882 F=1, -> cat5=1,cat11=1, 0.11054 Not Added 0.500 F=1,M=0.75, - > cat5=1, 0.14139 Not Added 0.667 F=1,R=0.25, - 18 > cat1=1,cat5=1, 19 F=1, -> cat13=0.75, 20 R=0.25, -> cat13=0.75, 21 M=0.75, -> cat13=0.75, F=1,M=0.5, -> cat1=1, F=0.75, -> 23 cat1=1, 24 F=1,R=0.5, -> cat1=1, R=0.5, -> 25 cat11=1, F=1,M=0.75, - 26 > cat3=1,cat =1, F=1, -> cat1=1 27,cat3=1,cat =1, F=1,area=0.75, 28 -> cat3=1, F=1,R=0.25 29,M=0.75, > cat3=1, F=1,R=0.25, - 30 > cat1=1,cat3=1, F=1,area=0.75, 31 -> cat =1, Unexpected added 0.000 Added 0.375 1 1 Unexpected added 0.000 Added 0.375 1 11 Not Added Not Added 0.500 5 5 0.875 44 25 0.10026 1.000 1.000 43 43 0.14653 Not Added 0.500 Not Added 0.500 13 13 0.11311 Not Added 0.500 Not 0.500 15 15 Added 0.11568 Not Added 0.500 Not 0.750 14 20 Added 0.11311 1.000 1.000 24 24 0.10026 Unexpected added 0.000 Added 0.375 1 17 0.12339 Not Added 0.500 Not 0.656 24 10 Added 0.10540 Unexpected added 0.000 Added 0.250 1 40 0.11054 Not Added 0.500 0.10540 Added 0.333 Not Added Not Added 0.500 28 28 0.583 5 5 0.10540 1.000 1.000 27 27 0.11568 Not Added 0.667 Not Added 0.667 30 30 0.11311 1.000 1.000 29 29 0.10797 1.000 1.000 36 36 108
32 34 35 36 F=1,M=0.75, - > cat1=1,cat =1, F=1,R=0.25,M=0.75, > cat11=1, F=1,R=0.25, - > cat1=1,cat =1, F=1,R=0.25,area=0.75, -> cat 1=1, F=1,R=0.25,M=0.75, > cat1=1, 0.11568 Not Added 0.500 Not Added 0.500 39 39 0.10026 1.000 1.000 40 40 0.10283 1.000 1.000 39 39 0.10540 1.000 1.000 37 37 0.12339 1.000 1.000 42 42 Rule Index rule1 Cluster3: Change mining by (Chen et al, 2005) measures & by Manhattan distance Table 4.29: Generated Rules for period 1 Cluster 3, Change mining by (Chen et al, 2005) measures & Manhattan distance Change Similarity Type -M -M Support Change Type Similarity Sim- Rule- Index1 Sim- Rule- Index2 1 M=1, -> cat11=0.25, 2 M=1, -> cat1=0.75, 3 M=1, -> cat3=0.25, 4 M=1, -> cat3=0.5, 5 M=1, -> cat11=0.5, 6 M=1, -> cat13=0.5, 7 F=0.75,M=1, -> cat3=0.75, 0.10588 0.10588 0.11765 0.12941 0.14118 0.15294 0.12941 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 14 14 1.000 10 10 1.000 1 1 1.000 4 4 1.000 5 5 1.000 8 8 1.000 19 19 109
8 F=0.5,M=1, -> cat5=0.5, 0.12941 Not 0.500 Not 0.656 7 11 9 F=0.75,M=1,area=0.5, -> cat5=0.75, 10 F=0.75,M=1, -> cat =0.75, 0.11765 Not 0.11765 Not 0.667 Not 0.500 Not 0.667 11 11 0.500 6 6 11 R=1,M=1, -> cat5=0.5, 12 R=0.5,M=1, -> cat5=0.75, 0.10588 Not 0.10588 Not 0.500 Not 0.500 Not 0.750 7 21 0.500 11 11 13 M=1, -> cat13=0.25, 14 F=0.25,R=1,M=1, -> cat5=0.25, 15 F=0.75,M=1, -> cat1=0.25, 0.22353 0.12941 Not 0.15294 Not 1.000 0.667 Not 0.500 Not 1.000 9 9 0.667 15 15 0.750 13 13 16 F=0.75,R=0.75,M =1, -> cat5=0.75, 17 M=1,area=1, -> cat5=0.75, 0.10588 Not 0.10588 0.667 Not 1.000 0.667 11 11 1.000 12 12 Table4. 30: Generated Rules for period 2 Cluster 3, Change mining by(chen et al, 2005) measures & Manhattan distance Rule Inde x rule1 Support Change Type Similarity Change Type -M Similarity -M Sim- Rule- Index1 Sim-Rule- Inde x2 1 M=1, -> cat3=0.25, 0.12775 1.000 1.000 3 3 110
2 M=1, -> cat13=0.75, 3 M=1, -> cat 1=0.5, 0.12775 Unexpected purchasing 0.14097 Unexpected purchasing 0.000 Not Added 0.750 1 6 0.000 Not Added 0.750 1 2 4 M=1, -> cat 3=0.5, 5 M=1, -> cat11=0.5, 6 M=1, -> cat11=0.75, 7 M=1, -> cat 5=0.5, 8 M=1, -> cat13=0.5, 9 M=1, -> cat13=0.25, 10 M=1, -> cat1=0.75, 11 F=0.75,M=1, -> cat5=0.75, 12 M=1,area=1, -> cat5=0.75, 13 F=0.25,M=1, -> cat1=0.25, 14 M=1, -> cat11=0.25, 15 F=0.25,M=1,area=1, -> cat5=0.25, 16 R=1,M=1, -> cat3=0.75, 17 M=1, -> cat 3=0.75,cat5=0.25, 18 F=0.5,M=1, -> cat3=0.75, 19 F=0.75,M=1, -> cat3=0.75, 20 M=1,area=1, -> cat3=0.75, R=1,M=1, -> 21 cat5=0.25, 0.14537 1.000 1.000 4 4 0.14978 1.000 1.000 5 5 0.14097 Not Added 0.500 Not Added 0.750 10 5 0.16300 Not Added 0.500 Not Added 0.500 8 8 0.17181 1.000 1.000 6 6 0.18062 1.000 1.000 13 13 0.18943 1.000 1.000 2 2 0.17621 Not Added 0.667 Not Added 0.667 9 9 0.11454 1.000 1.000 17 17 0.11454 Not Added 0.500 Not Added 0.750 15 15 0.22467 1.000 1.000 1 1 0.10132 Not Added 0.667 Not Added 0.667 14 14 0.11013 Not Added 0.500 Not Added 0.500 7 7 0.11013 Added 0.250 Added 0.375 7 4 0.13656 Not Added 0.500 Not Added 0.875 7 7 0.10132 1.000 1.000 7 7 0.12335 Not Added 0.500 Not Added 0.500 7 7 0.13656 Not Added 0.667 Not Added 0.750 14 11 111
Rule index rule1 Cluster4: Change mining by (Chen et al, 2005) measures & by Manhattan distance Table4.31: Generated Rules for period 1 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance Similarity -M Support Change Type Similarity Change Type -M Sim- Rule- Index 1 Sim- Rule - Index 2 1 M=1, -> cat =0.5, 0.10569 1.000 1.000 1 1 Unexpected Unexpected 2 F=1,M=1, -> cat2=1, 0.10569 purchasing 0.000 purchasing 0.000 1 1 3 F=1,M=1, -> cat3=0.75, 4 F=1,R=0.25,M=1, -> cat1=0.75 5 F=1,M=1, -> cat5=0.75, 6 F=1,M=1, -> cat1=0.25,cat5=1 7 R=0.25,M =1, -> cat1=0.25, F=1,R=0.25,M=1 8,area=0.25, -> cat5=1, 9 F=1,M=1, -> cat11=1,cat 13=0.75, 10 F=1,M=1, -> cat5=1,cat13=0.75, F=1,R=0.25,M=1, -> cat13=0.75, M=1,area=1, -> cat 3=1 12,cat5=1,cat13=1, F=1,area=1, -> cat3=1 13,cat5=1,cat13=1, R=0.25,area=1, -> 14 cat3=1,cat5=1,cat13=1, 0.13008 Not 0.500 Not 0.500 4 4 0.12195 Not 0.667 Not 0.667 12 12 0.14634 Perished 0.250 Perished 0.375 6 4 0.12195 Not 0.10569 Not 0.10569 0.10569 Not 0.14634 0.11382 Not 0.10569 Not 0.10569 Not 0.10569 Not 0. 500 Not 0.750 4 11 0.500 Not 0.500 2 2 1.000 1.000 18 18 0.500 Not 0.583 8 42 1.000 1.000 8 8 0.667 Not 0.667 9 9 0.667 Not 0.667 22 0.500 Not 0.500 79 79 0.500 Not 0.500 78 78 112
15 R=0.25,M =1,area=1, - > cat3=1,cat13=1, 16 F=1,R=0.25,area=1, -> cat3=1,cat13=1, 17 F=1,M=1,area =1, -> cat3=1,cat13=1, 18 R=0.25,M =1,area=1, - > cat5=1,cat13=1, 19 F=1,R=0.25,area=1, -> cat5=1,cat13=1, 20 F=1,M=1,area =1, -> cat5=1,cat13=1, 21 F=1,R=0.25,M=1,area=1, -> cat13=1, R=0.25,M =1,area=1, - > cat5=1,cat11=1, 23 F=1,R=0.25,area=1, -> cat5=1,cat11=1, 24 F=1,M=1,area =1, -> cat5=1,cat11=1, 25 F=1,R=0.25,M=1,area=1, -> cat11=1, 26 R=0.25,M =1,area=1, - > cat3=1,cat5=1, 27 F=1,R=0.25,area=1, -> cat3=1,cat5=1, 28 F=1,M=1,area =1, -> cat3=1,cat5=1, 29 F=1,R=0.25,M=1,area=1, -> cat3=1, 30 F=1,R=0.25,M=1,area=1, -> cat5=1, 31 F=1,M=1,area =0.75, - > cat13=1, 32 F=1,M=1,area =0.75, - > cat11=1, F=1,M=1,area =0.75, - > cat3=1, 34 R=0.25,M =1,area=0.75, -> cat5=1, 35 F=1,M=1,area =0.75, - > cat5=1, 36 F=1,R=0.25,M=1,area=0.5, -> cat1=1, 37 F=1,R=0.25,M=1,area=0.5, -> cat13=1, 0.10569 Not 0.10569 Not 0.10569 Not 0.12195 Not 0.12195 Not 0.12195 Not 0.12195 Not 0.10569 Not 0.10569 Not 0.10569 Not 0.12195 Not 0.13008 Not 0.13008 Not 0.13821 Not 0.13821 Not 0.17886 0.11382 0.13008 Not 0.10569 0.10569 Not 0.16260 Not 0.12195 Not 0.15447 0.667 Not 0.667 81 81 0.667 Not 0.667 81 81 0.667 Not 0.667 81 81 0.667 Not 0.667 82 82 0.667 Not 0.667 82 82 0.667 Not 0.667 82 82 0.750 Not 0.875 26 26 0.667 Not 0.667 87 87 0.667 Not 0.667 87 87 0.667 Not 0.667 87 87 0.500 Not 0.563 16 16 0.667 Not 0.667 22 0.667 Not 0.667 88 88 0.667 Not 0.667 22 0.750 Not 0.875 23 28 1.000 1.000 24 24 1.000 1.000 31 31 0.667 Not 0.833 16 16 1.000 1.000 32 32 0.750 Not 0.750 33 0.750 Not 0.750 33 0.500 Not 0.688 13 13 1.000 1.000 26 26 113
38 F=1,M=1,area =0.5, -> cat3=1,cat11=1, 39 F=1,R=0.25,M=1,area=0.5, -> cat11=1, 40 F=1,R=0.25,M=1,area=0.5, -> cat3=1, 41 F=1,R=0.25,M=1,area=0.5, -> cat5=1, 42 M=1, -> cat1=1,cat 3=1,cat11=1,cat13=1, 43 F=1, -> cat1=1,cat3=1,cat11=1 R=0.25, -> cat1=1,cat3=1,cat11=1,cat13=1, 45 M=1, -> cat1=1,cat 5=1,cat11=1,cat13=1, 46 F=1, -> cat1=1,cat5=1,cat11=1,cat13=1, R=0.25, -> cat1=1 47,cat5=1,cat11=1,cat13=1, 48 R=0.25,M =1, -> cat1=1,cat11=1,cat13=1, 49 F=1,R=0.25, -> cat 1=1,cat11=1,cat13=1, 50 F=1,M=1, -> cat1=1,cat11=1,cat13=1, 51 M=1, -> cat1=1,cat3=1,cat5=1,cat13=1, 52 F=1, -> cat1=1,cat3=1,cat5=1,cat13=1, R=0.25, -> cat1=1 53,cat3=1,cat5=1,cat13=1, 54 R=0.25,M =1, -> cat1=1,cat3=1,cat13=1, F=1,R=0.25, -> cat 1=1,cat3=1,cat13=1, 56 F=1,M=1, -> cat1=1,cat3=1,cat13=1, 57 R=0.25,M =1, -> cat1=1,cat5=1,cat13=1, 58 F=1,R=0.25, -> cat 1=1,cat5=1,cat13=1, 0.10569 Not 0.12195 Not 0.13008 0.13821 0.13821 0.13821 0.13008 0.13821 0.13821 0.13821 0.16260 0.16260 0.17073 0.13821 0.13821 0.13008 0.15447 0.15447 0.17073 0.16260 0.16260 0.667 Not 0.667 86 86 0.500 Not 0.688 16 16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 28 28 1.000 29 29 1.000 34 34 1.000 35 35 1.000 36 36 1.000 37 37 1.000 38 38 1.000 39 39 1.000 40 40 1.000 41 41 1.000 42 42 1.000 43 43 1.000 44 1.000 45 45 1.000 46 46 1.000 47 47 1.000 48 48 1.000 49 49 1.000 50 50 114
F=1,M=1, -> cat1=1 59,cat5=1,cat13=1, F=1,R=0.25,M=1, -> 60 cat1=1,cat13=1, 61 M=1, -> cat1=1,cat3=1,cat5=1,cat11=1, F=1, -> cat1=1,cat3=1 62,cat5=1,cat11=1, R=0.25, -> cat1=1 63,cat3=1,cat5=1,cat11=1, R=0.25,M =1, -> 64 cat1=1,cat3=1,cat11=1, F=1,R=0.25, -> cat 1=1 65,cat3=1,cat11=1, F=1,M=1, -> cat1=1,cat3=1,cat11=1, R=0.25,M =1, -> cat1=1 67,cat5=1,cat11=1, F=1,R=0.25, -> cat 1=1 68,cat5=1,cat11=1, F=1,M=1, -> cat1=1 69,cat5=1,cat11=1, F=1,R=0.25,M=1, -> 70 cat1=1,cat11=1, R=0.25,M =1, -> cat1=1 71,cat3=1,cat5=1, F=1,R=0.25, -> cat 1=1 72,cat3=1,cat5=1, F=1,M=1, -> cat1=1 73,cat3=1,cat5=1, F=1,R=0.25,M=1, -> 74 cat1=1,cat3=1, F=1,R=0.25,M=1, -> 75 cat1=1,cat5=1, M=1, -> cat3=1,cat 5=1 76,cat11=1,cat13=1, F=1, -> cat3=1,cat5=1,cat11=1,cat13=1, 78 R=0.25, -> cat3=1,cat5=1,cat11=1,cat13=1, R=0.25,M =1, -> 79 cat3=1,cat11=1,cat13=1, 0.17886 0.21138 0.13821 0.13821 0.13821 0.16260 0.16260 0.17886 0.18699 0.18699 0.19512 0.21951 0.16260 0.16260 0.17886 0.19512 0.21951 0.15447 0.15447 0.14634 0.17073 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 51 51 1.000 52 52 1.000 53 53 1.000 54 54 1.000 55 1.000 56 56 1.000 57 57 1.000 58 58 1.000 59 59 1.000 60 60 1.000 61 61 1.000 62 62 1.000 63 63 1.000 64 64 1.000 65 65 1.000 66 1.000 67 67 1.000 68 68 1.000 69 69 1.000 70 70 1.000 71 71 115
80 F=1,R=0.25, -> cat 3=1,cat11=1,cat13=1, 81 F=1,M =1, -> cat3=1,cat11=1,cat13=1, 82 R=0.25,M =1, -> cat5=1,cat11=1,cat13=1, 83 F=1,R=0.25, -> cat 5=1,cat11=1,cat13=1, 84 F=1,M=1, -> cat5=1,cat11=1,cat13=1, 85 F=1,R=0.25,M=1, -> cat11=1,cat13=1, 86 R=0.25,M =1, -> cat3=1,cat5=1,cat13=1, 87 F=1,R=0.25, -> cat 3=1,cat5=1,cat13=1, F=1,M=1, -> cat3=1,cat5=1,cat13=1, 89 F=1,R=0.25,M=1, -> cat3=1,cat13=1, 90 F=1,R=0.25,M=1, -> cat5=1,cat13=1, 91 R=0.25,M =1, -> cat3=1,cat5=1,cat11=1, 92 F=1,R=0.25, -> cat 3=1,cat5=1,cat11=1, 93 F=1,M=1, -> cat3=1,cat5=1,cat11=1, 94 F=1,R=0.25,M=1, -> cat3=1,cat11=1, 95 F=1,R=0.25,M=1, -> cat5=1,cat11=1, 96 F=1,R=0.25,M=1, -> cat3=1,cat5=1, 0.17073 0.18699 0.17886 0.17886 0.18699 0.22764 0.23577 0.23577 0.25203 0.28455 0.31707 0.19512 0.19512 0.22764 0.25203 0.30081 0.30081 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 72 72 1.000 73 73 1.000 74 74 1.000 75 75 1.000 76 76 1.000 77 1.000 78 78 1.000 79 79 1.000 80 80 1.000 81 81 1.000 82 82 1.000 83 83 1.000 84 84 1.000 85 85 1.000 86 86 1.000 87 87 1.000 88 116
Table4.32: Generated Rules for period 2 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance Change Sim- Type - Rulerule1 Support Similarity M Index1 Rule Index Change Type Similarity -M Sim- Rule- Index2 1 M=1, -> cat11=0.5, 0.10081 1.000 1.000 1 1 2 3 M=1, -> cat1=0.25, 0.11290 Not Added 0.500 Not Added 3 M=1, -> cat 11=0.75 0.12097 Unexpected 0.000 Not, purchasing Added 4 F=1,M =1, -> cat3=0.75,cat5=1, 5 R=0.25,M=1, -> cat3=0.75, 6 M=1, -> cat3=1,cat5=0.75, 7 M=1, -> cat3=1,cat 13=0.75, 8 F=1,M=1, -> cat5=1,cat13=0.75, 9 R=0.25,M=1, -> cat13=0.75, 10 M=1, -> cat1=0.75,cat3=1, F=1,M =1, -> cat1=0.75,cat5=1, 12 R=0.25,M=1, -> cat1=0.75, 13 F=1,M =1,area=0.25, -> cat1=1, 14 F=1,M=1,area=0.25, -> cat13=1, 15 M=1,area=0.25, -> cat3=1,cat11=1, 16 F=1,M=1,area=0.25, > cat =1, 17 F=1,M =1,area=0.25, -> cat3=1, 18 F=1,R=0.25,M=1,area=0.25, -> cat5=1, 0.500 7 7 0.750 1 1 0.10484 Not Added 0.500 Not Added 0.583 3 73 0.10081 Not Added 0.500 Not 0.500 3 3 Added 0.10484 Added 0.250 Not 0.438 5 51 Added 0.11290 Added 0.250 Not 0.438 9 42 Added 0.11290 1.000 1.000 10 10 0.10887 Not Added 0.667 Not Added 0.667 11 0. 11290 Added 0.250 Not 0.438 42 42 Added 0.10081 Not Added 0.500 Not 0.750 6 6 Added 0.10887 Not Added 0.667 Not 0.667 4 4 Added 0.10484 Not Added 0.500 Not 0.688 36 36 Added 0.10484 Not Added 0.667 Not 0.833 31 31 Added 0.10484 Added 0.333 Not 0.583 38 38 Added 0.11694 Not Added 0.667 Not 0.833 32 32 Added 0.12903 Not Added 0.667 Not 0.833 33 Added 0.10081 1.000 1.000 8 8 117
19 M=1,area=1, -> cat 1=1, 20 M=1,area=1, -> cat 13=1, 21 M=1,area=1, -> cat 11=1, M=1,area=1, -> cat3=1,cat5=1, 23 F=1,M =1,area=1, - > cat3=1, 24 F=1,R=0.25,M=1,area=1, -> cat5=1, 25 M=1,area=0.5, -> cat1=1, 26 F=1,R=0.25,M=1,area=0.5, -> cat13=1, 27 M=1,area=0.5, -> cat11=1, 28 F=1,R=0.25,M=1,area=0.5, -> cat3=1, 29 F=1,R=0.25,M=1,area=0.5, -> cat5=1, 30 M=1,area=0.75, -> cat1=1, 31 F=1,M=1,area=0.75, -> cat13=1, 32 F=1,M =1,area=0.75, -> cat3=1, F=1,R=0.25,M=1,area=0.75, -> cat5=1, 34 M=1, -> cat 1=1,cat3=1,cat =1,cat13=1, 35 F=1, -> cat1=1,cat3=1,cat11=1,cat13=1, 36 R=0.25, -> cat1=1,cat3=1,cat11=1,cat13=1, 37 M=1, -> cat 1=1,cat5=1,cat =1,cat13=1, 38 F=1, -> cat1=1,cat5=1,cat11=1 0.10081 Added 0.250 Added 0.375 36 36 0.10081 Not Added 0.500 Not Added 0.583 21 31 0.10081 Not Added 0.500 Not 0.583 25 32 Added 0.10081 Not Added 0.667 Not 0.667 12 12 Added 0.10081 Not Added 0.750 Not 0.917 29 Added 0.10081 1.000 1.000 30 30 0.10484 Not Added 0.500 Not 0.500 36 36 Added 0.10887 1.000 1.000 37 37 0.10081 Not Added 0.500 Not 0.583 39 32 Added 0.10887 1.000 1.000 40 40 0.12097 1.000 1.000 41 41 0.10081 Added 0.250 Not 0.438 36 36 Added 0.10081 1.000 1.000 31 31 0.11694 1.000 1.000 33 0.11694 Not Added 0.750 Not Added 0.938 8 30 0.12903 1.000 1.000 42 42 0.12500 1.000 1.000 43 43 0.10484 1.000 1.000 44 0.13710 1.000 1.000 45 45 0.13306 1.000 1.000 46 46 118
,cat13=1, 39 R=0.25, -> cat1=1,cat5=1,cat11=1,cat13=1, 40 R=0.25,M =1, -> cat1=1,cat11=1,cat13=1, 41 F=1,R=0.25, -> cat1=1,cat =1,cat13=1, 42 F=1,M =1, -> cat1=1,cat11=1,cat13=1, M=1, -> cat1=1 43,cat3=1,cat5=1,cat13=1, F=1, -> cat1=1,cat3=1,cat5=1,cat13=1, 45 R=0.25, -> cat1=1,cat3=1,cat5=1,cat13=1, 46 R=0.25,M=1, -> cat1=1,cat3=1,cat13=1, 47 F=1,R=0.25, -> cat1=1,cat3=1,cat13=1, 48 F=1,M =1, -> cat1=1,cat3=1,cat13=1, 49 R=0.25,M=1, -> cat1=1,cat5=1,cat13=1, 50 F=1,R=0.25, -> cat1=1,cat5=1,cat13=1, 51 F=1,M =1, -> cat1=1,cat5=1,cat13=1, 52 F=1,R=0.25,M=1, - > cat1=1,cat13=1, 53 M=1, -> cat1=1,cat3=1,cat5=1,cat =1, 54 F=1, -> cat1=1,cat3=1,cat5=1,cat =1, 0.11694 1.000 1.000 47 47 0.13306 1.000 1.000 48 48 0.13306 1.000 1.000 49 49 0.16532 1.000 1.000 50 50 0.12097 1.000 1.000 51 51 0.11694 1.000 1.000 52 52 0.10887 1.000 1.000 53 53 0.13306 1.000 1.000 54 54 0.13306 1.000 1.000 55 0.15323 1.000 1.000 56 56 0.14919 1.000 1.000 57 57 0.14919 1.000 1.000 58 58 0.16532 1.000 1.000 59 59 0.18548 1.000 1.000 60 60 0.13306 1.000 1.000 61 61 0.12903 1.000 1.000 62 62 119
R=0.25, -> cat1=1,cat3=1,cat5=1,cat =1, 56 R=0.25,M=1, -> cat1=1,cat3=1,cat =1, 57 F=1,R=0.25, -> cat1=1,cat3=1,cat =1, 58 F=1,M =1, -> cat1=1,cat3=1,cat =1, 59 R=0.25,M=1, -> cat1=1,cat5=1,cat =1, F=1,R=0.25, -> 60 cat1=1,cat5=1,cat =1, 61 F=1,M =1, -> cat1=1,cat5=1,cat =1, 62 F=1,R=0.25,M=1, - > cat1=1,cat =1, 63 R=0.25,M=1, -> cat1=1,cat3=1,cat5=1, 64 F=1,R=0.25, -> cat1=1,cat3=1,cat5=1, 65 F=1,M=1, > cat1=1,cat3=1,cat5=1, F=1,R=0.25,M=1, - > cat1=1,cat3=1, 67 F=1,R=0.25,M=1, - > cat1=1,cat5=1, 68 M=1, -> cat 3=1,cat5=1,cat =1,cat13=1, 69 F=1, -> cat3=1,cat5=1,cat11=1,cat13=1, 70 R=0.25, -> cat3=1,cat5=1,cat11=1,cat13=1, 71 R=0.25,M=1, -> cat3=1,cat11=1,cat13=1, 0.10484 1.000 1.000 63 63 0.13306 1.000 1.000 64 64 0.12903 1.000 1.000 65 65 0.18145 1.000 1.000 66 0.13710 1.000 1.000 67 67 0.13710 1.000 1.000 68 68 0.16935 1.000 1.000 69 69 0.16129 1.000 1.000 70 70 0.14113 1.000 1.000 71 71 0.14113 1.000 1.000 72 72 0.16935 1.000 1.000 73 73 0.18145 1.000 1.000 74 74 0.19758 1.000 1.000 75 75 0.15323 1.000 1.000 76 76 0.14113 1.000 1.000 77 0.12903 1.000 1.000 78 78 0.15323 1.000 1.000 79 79 120
72 F=1,R=0.25, -> cat3=1,cat =1,cat13=1, 73 F=1,M =1, -> cat3=1,cat11=1,cat13=1, 74 R=0.25,M=1, -> cat5=1,cat11=1,cat13=1, 75 F=1,R=0.25, -> cat5=1,cat =1,cat13=1, 76 F=1,M =1, -> cat5=1,cat11=1,cat13=1, F=1,R=0.25,M=1, - > cat11=1,cat13=1, 78 R=0.25,M=1, -> cat3=1,cat5=1,cat13=1, 79 F=1,R=0.25, -> cat3=1,cat5=1,cat13=1, 80 F=1,M =1, -> cat3=1,cat5=1,cat13=1, 81 F=1,R=0.25,M=1, - > cat3=1,cat13=1, 82 F=1,R=0.25,M=1, - > cat5=1,cat13=1, 83 R=0.25,M=1, -> cat3=1,cat5=1,cat =1, 84 F=1,R=0.25, -> cat3=1,cat5=1,cat =1, 85 F=1,M =1, -> cat3=1,cat5=1,cat =1, 86 F=1,R=0.25,M=1, - > cat3=1,cat =1, 87 F=1,R=0.25,M=1, - > cat5=1,cat =1, 88 F=1,R=0.25,M=1, - > cat3=1,cat5=1, 0.14919 1.000 1.000 80 80 0.18548 1.000 1.000 81 81 0.16935 1.000 1.000 82 82 0.16532 1.000 1.000 83 83 0.18952 1.000 1.000 84 84 0.18952 1.000 1.000 85 85 0.19355 1.000 1.000 86 86 0.18952 1.000 1.000 87 87 0.21371 1.000 1.000 88 0.22984 1.000 1.000 89 89 0.26613 1.000 1.000 90 90 0.16935 1.000 1.000 91 91 0.16532 1.000 1.000 92 92 0.20161 1.000 1.000 93 93 0.19758 1.000 1.000 94 94 0.22984 1.000 1.000 95 95 0.29032 1.000 1.000 96 96 121
Here, for better explanation, we compare two rules similarity by measures of (Chen et al, 2005) and by our modified measure with Manhattan distance. For example in cluster 1, we have two rule s as followed: T2-r5: R=0.75, -> cat1=0.25 (Chen et al, 2005)'s similarity= 0.000 Our similarity= 0.375 with the t1-r4: F=0.25, R=1, -> cat1=0.25. This means that in the first method, because R has different values in two time snapshots, the similarity become zero but at least these rules both has R in their LHS but their values are different. We calculated the difference based on the distance between the two R values in two rules and gain more information. Another example in cluster4: T2-r3:M=1 -> cat11=0.75 (Chen et al, 2005)'s similarity= 0.000 Our similarity= 0.750 with the t1-r1: M=1 -> cat11=0.5. Again, by the (Chen et al, 2005)'s similarity, the similarity become zero, while, Cat11 is in the both RHSs. We calculated the difference based on the distance between the two R values in two rules and gain more information, because the similarities of these rules are not zero. In this chapter we explain about the steps that we have done to mine changes in customer behavior. Our contribution in this study is using Manhattan distance to gain more information from the rules and increase the accuracy of the change measures. In average, we have 6.65% improvement in the change mining measures by using Manhattan distance. 122
Chapter5: Conclusion, further research Conclusion Our contribution Limitation Managerial implication Future works 123
5.1Conclusion: In this study, we mined the purchasing behavior of Kalleh Distribution Company. The world around us changes constantly. One of the most important aspects of surviving in a dynamic market is to know and adapt to changes happening in customer behavior. Knowing and adapting to changes is an important aspect of our lives. For businesses, knowing what is changing and how it has changed is also crucial (Liu et al, 2000). In Fast Moving Consumer Goods (FMCG) Distribution Company like Kalleh, this issue has importance. Kalleh is faced with the challenge of increasing competition. There are variety of FMCGs products, distribution companies and their different strategies so in such a market; the customer behavior may change by the of companies strategies in the market and also by changing their need by themselves. In order to combat with these problems, Kalleh Company wants to find changes happening in the market by analyzing purchase transaction data. For mining changes, we should compare customer purchasing behavior during two periods. The purpose of this study is to mine changes in customer purchasing behavior. In order to reach this goal we need to building customer purchasing patterns of customers based on the customer, product and transaction data collected in databases. Data mining techniques can help us to reach this goal. Change mining has some steps including data collection, data pre-processing, customer segmentation based on RFM and by Customer Value Matrix, building customer behavior patterns by mining association rule and finally comparing generated association rule by two measures of similarity and unexpectedness. In this study research process is shown in figure 3.3. This process constructed based on the previous works in the literature. In data collection phase, we gathered data from two years of transactions of Kalleh Distribution Company. In the data pre-processing phase we have some steps as followed, Data Cleaning is one of data preprocessing steps to remove noisy or inconsistent data. In this study, we have some noisy data which are the customers who belongs to Kalleh Company. So we removed them from the database. During two periods that we analyze, there were 2499 customers but 42 customers belong to Kalleh Company, so we remove them from the database. Total number of customer after removing noisy data became 2457. In Data Transformation phase in one step we did generalization that we build 6 groups of products by expert opinion which is shown in fig 4.1. The second task in data transformation, we build RFM variables. For calculating RFM, first we divided our dataset to two time snapshot, one between '1383/07/01' AND '1384/06/31' as 124
period one or t1 and the second one between '1384/07/01' AND '1385/06/31' as period two or t2. We defined recency by calculating the interval between the last date of purchase and the last date of each period which for period. It means that the evaluating time for these two time snapshots are '1384/06/31' and '1385/06/31'. For frequency and monetary, we aggregate the transaction data to calculate the total number of purchases and total amount spent during each period. According to the market segmentation by (Marcus, C., 1998), we need the average purchasing of each customer. So we divide total purchase amount by total number of purchases to calculate average amount of each purchase. Customer segmentation is the next step in this study. According to (Marcus, C., 1998), we divided customers to four clusters in each period which include uncertain, frequent, spender and best. According to Customer Value matrix, we have two axes. The calculation steps of Customer Value Matrix and its result are in the following section. In table 4.4, tables 4.5, the results of market segmentation are shown. We built four clusters of customers include uncertain, spender, frequent and best. The next step is customer behavior mining. In this phase, we applied association rules to analyze the patterns of customer behavior of different time snapshots for each customer cluster. There are some other methods for change mining in the literature like decision trees but we have chosen association rules because according to (Song et al, 2001) by them we can detect complete sets of changes. Association rule work s with discrete variables, therefore, in the first phase we need to do discretization. We have used the equal frequency binning to discretize the RFM data. For discretizing the area, based on the market expert opinion and their knowledge about the area we have define four groups. Also we have discretized the number of purchases for each product category. In this phase we built association rule which in the left hand side of the rule customer profile data and RFM variables exists and in the right hand side of the rule, purchased product. The minimum support and confidence is 17%. Also in Apriori algorithm we have used maximum frequent itemsets. After building association rules, we compared these rule to mine changes in customer behavior. We have two measures: similarity and unexpectedness that evaluate how much two rules are similar or different from (Chen et al, 2005). We calculated changes for each cluster in two periods that are shown in chapter4. 125
5.2Our contribution: The next step was to building customer behavior patterns which in their RHS, instead of saying just which products were bought, the number of purchase per product category mentioned. When we use ordinal number instead of binary values of bought or not bought, we bring more information. When we compare the values of common attribute in LHS and RHS of two rules, we have more accuracy to find difference and similarities between rules. This time the support and confidence of rules were 10%. Then mining changes in generated rule is done by (Chen et al, 2005) measures and by Manhattan distance formula in chapter3, section 3.12. The results showed that 6.5% in average the accuracy of the change mining improves which is our contribution in this study. 5.3Limitation: In doing this research we have some limitation. One of them is finding a good database that saves useful attributes in it. In our study we need demographic variables of the customer but in the database we just found the geographic area of each customer to work. 5.4Managerial Implication: In this section, we summarize the various opportunities of using change mining methodology. The findings of this study have great implication for many businesses like distribution companies. These companies should work in a dynamic environment. Their customers are influenced by different internal and external factors. In such a dynamic environment, knowing the changes and adapting to them for businesses are crucial. In macro aspects, business managers can follow the of the market in order to provide suitable products and services for their customer (Liu et al, 2000). If marketing managers find the changes in the market, they can find the reason of changes in time and show a right reaction to changes. In Micro aspect, change mining can help managers to better understand their customer needs by their behavior and design additional niche marketing campaigns (Song et al, 2001). Change detection is more suitable in dynamic domain where the human intervention is high. Another application of change mining can be analyzing the effectiveness of marketing campaigns. Change mining can be used in manufacturing to monitor changes and control the quality factors. Changes of vario us measures of product quality can be properly controlled (Song et al, 2001). Change mining can play an important role especially in FMCG market which the competition is high. Also, because of huge amount of data that are recorded in 126
these companies database, using data mining methods like change mining bring hidden and useful information from data. We believe that the change detection problem will become more important as more data mining applications are implemented. 5.5Future works: In this study, building rules we have just RFM variable and geographic variable. It is because of the Kalleh database just stored these variables. There fore the further research may be use other demographic variables like the type of the customers. In this research, for change mining we compare each rule in one time snapshot s with all of the rules in the other time snapshot. Therefore the further research is to do this comparison more efficiently. References: Adomavicius, G. Tuzhilin, A., (2001), Using data mining methods to build customer profiles. IEEE computer,volume: 34, Issue: 2, pp.74-82 Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining assosiaction rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on management of data, (pp. 207-216) Agrawal, R. & Psaila, G., (1995), Active Data Mining, First International Conference on Knowledge Discovery and Data Mining ({KDD}-95), PP. 3-8. Agrawal, R., & Srikant, R. (1994), Fast algorithem for mining association rules. In Proceedings of the international Conference on Very Large Databases,VLDB-94, (pp.487-499) Ayad, A. M., (2000), Incremental mining of constrained association rules, Master Thesis, Alexandira University, Faculty of Engineering. Bay, D.S. & Pazzani, M., J. (1999), Detecting Change in Categorical Data: Mining Contrast Sets, Knowledge Discovery and Data Mining,pp.302-306 Berry, M.J.A, and Linoff,G.S (2004), Data mining techniques for marketing, sales and customer relationship management(2nd edn), Indiana, Indianapolis publishing Inc 127
Bolton, R. J., David J. Hand, Martin Crowder, (2004), Significance tests for unsupervised pattern discovery in large continuous multivariate data sets, Computational Statistics & Data Analysis Volume 46, Number 1, pp. 57 79 Böttcher, M., Nauck, D., Borgelt, C., & Kruse, R., ( 2006), A framework for discovering interesting business changes from data, BT Technology Journal, volume 24, issue 2,pp. 219 228 Brin, S., et al., (1997), Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proc. ACM SIGMOD Int l Conf. Management of Data, ACM Press, New York, pp. 255-264. Chen, M.C, Chiub, A.L, Chang, H.H, (2005), Mining changes in customer behavior in retail marketing, Expert System with Applications, Volume 28, Issue 4, pp.773-781 Cho, Y. B., Cho. Y. H., Kim, S. H., 2005, Mining changes in customer buying behavior for collaborative recommendations, Expert Systems with Applications 28 (2005), (pp. 359 369) Dong, G., & Li, J. (1999), Efficient mining of emerging patterns: discovering s and differences, Conference on Knowledge Discovery in Data archive, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 43-52. Dunham M.H., Xiao Y., Gruenwald Y.L., Hossain Z., (2001), a survey of association rule mining, ACM Survey Journal (submitted), available at http://www2.cs.uh.edu/~ceick/6340/grue-assoc.pdf Feelders, A.J.; Daniels, H.A.M.; Holsheimer, M., (2000), Methodological and practical aspects of datamining, Information & Management, vol.37, pp.271-281 Han, J. & Kamber, M., (2006). Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers. Hossein Javaheri, S., (2008), Response Modeling in Direct Marketing: a data mining based approach for target selection, Master's thesis, epubl.luth.se/1653-0187/2008/014/ltu-pb-ex-08014-se.pdf Kantardzic, M., (2003), Data Mining: Concepts, Models, Methods, and Algorithms byjohn Wiley & Sons 128
Larose, D.T. (2006) Data mining methods and model, Hoboken, New Jersey, John Wiley & sons, Inc Li, J., Dong, G., Ramamohanarao, K., (2000), Instance-Based Classification by Patterns, Principles of Data Mining and Knowledge Discovery, PP.191-200. Li, X. B., (2005), A scalable decision tree system and its application in pattern recognition and intrusion detection, Decision Support Systems, Volume 41, issue 1, pp.112 130 Liu, B., Hsu, W., (1996), Post analysis of learnt rules." AAAI-96 Liu, B., Hsu, W., Han, H. S., & Xia, Y. (2000). Mining changes for real-life applications, Lecture Notes In Computer Science; Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery, Volume 1874, pp. 337 346. Liu,H.;Hussain, F.;Tan, C.L&Manornjan Dash, (2002), Discretization: An Enabling Technique, Data Mining and Knowledge Discovery, Volume 6, Number 4, pp.393 423 Malhotra, K.N (1996), Marketing Research: an applied orientation, india, Pearson Education Marcus, C. (1998), A practical yet meaningful approach to customer segmentation. Journal of Consumer Marketing, Volume 15, issue 5, pp.494 504. Miglautsch, J.R.,(2001), Thoughts on RFM Scoring, Journal of Database Marketing, Volume 8, issue 1, pp. 67-72. Min, S., H.& Han, I., (2005), Detection of the customer time-variant pattern for improving recommender systems, Expert Systems with Applications, Volume 28, Issue 2, pp.189 199 Nemati, H.R. & Barko, C. D., (2003), Key factors for achieving organizational datamining success, Industrial Management & Data Systems, Volume, 103 Issue, 4 pp. 282-292 129
Novo, j., (2008), Drilling Down:Turning Customer Data into Profits with a Spreadsheet, www.jimnovo.com Park, J.S.; Chen, M.-S. & Philip, S.Y., (1995), An Effective Hash Based Algorithm for Mining Association Rules, Proc. International Conference on Management of Data, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pp.175-186. Savasere, A. ; Omiecinski, E., & Navathe, S., (1995), An Efficient Algorithm for Mining Association Rules in Large Databases, Proc. 21st Int l Conf. Very Large Data Bases, Morgan Kaufmann, San Francisco, pp. 432-444 Saunders, M., lewis, P., and Thornhill A., (2000), Research Methods for Business Students, England, Pearson Education Limited Silberschatz, A., & & Tuzhilin, A., (1996). What makes patterns interesting in knowledge discovery systems? IEEE Transactions on Knowledge and Data Engineering, 8 (6), (pp. 970-974). Song, H. S., Kim, J. K., & Kim, S. H. (2001). Mining the change of customer behavior in an internet shopping mall. Expert System with Applications, Volume 21, issue 3, 157 168. Su, J. H. & Lin, W. Y., (2004), CBW: an efficient algorithm for frequent itemset mining, System Sciences, 2004. Proceedings of the 37th Annual Hawaii International Conference on Volume, Issue, 5-8 Jan. 2004 Page(s): 9 pp. Thomas, S.;Bodagala, S.; Alsabti, K. & Ranka, S., (1997), An efficient algorithm for the incremental updation of association rules in large databases, In Knowledge Discovery and Data Mining, pp. 263-266 Toivonen, H., (1996), Sampling Large Databases for Association Rules, Very Large Data Bases, Proceedings of the 22th International Conference on Very Large Data Bases, pp. 134-145 Tsai, C.,Y. & Chiu, C.-C., (2004), A purchase-based market segmentation methodology, Expert Systems with Applications,Volume 27, Issue 2, pp.265 276 Twocrows.com, (2005), www.twocrows.com/intro-dm.pdf Yin, R.K, (1994), Case Study Research, design and methods (2th edn), California, 130
Thousand Oaks, Sage publication, Inc Zaki, M., J.; Parthasarathy, S.; Ogihara, M. & Li, W., (1997), New Parallel Algorithms for Fast Discovery of Association Rules, Data Mining and Knowledge Discovery, Volume 1, Number 4, pp. 343-373. Zhao, Q. & Bhowmick, S.S., (2003),Association Rule Mining: A Survey, Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116, 200 Wu, J. & Lin, Z., (2005), Research on Customer Segmentation Model by Clustering, ACM International Conference Proceeding Series, Proceedings of the 7th international conference on Electronic commerce, Vol. 113, pp-316-318 Softwares: R Software, 2007, version 2.7.0, www.r-project.com SQL server, 2000, www.microsoft.com 131