MASTER'S THESIS. Mining Changes in Customer Purchasing Behavior

Transcription

1 MASTER'S THESIS 2009:097 Mining Changes in Customer Purchasing Behavior - a Data Mining Approach Samira Madani Luleå University of Technology Master Thesis, Continuation Courses Marketing and e-commerce Department of Business Administration and Social Sciences Division of Industrial marketing and e-commerce 2009:097 - ISSN: ISRN: LTU-PB-EX--09/097--SE

2 Abstract: The world around us is changing all the time. For businesses, knowing what is changing and how it has changed is also crucial. One of the most important aspects of surviving in a dynamic market is to know and adapt to changes happening in customer behavior. In Fast Moving Consumer Goods (FMCG) Distribution Company, this issue has more importance. Because of the variety of FMCGs products, distribution companies and their different strategies, the purchasing behavior of customers may change many times during a period and the competition become tougher. The purpose of this study is to help Kalleh Company as a manufacturer and distributor of food products in Iran market to mine changes happening in their customer behavior. Mining changes has several steps includes data collection, data preprocessing, customer segmentation, mining customer behavior patterns and change mining. For customer segmentation, we use Customer Value Matrix. For mining pattern of behavior, we use Apriori algorithm and maximal frequent itemsets. We have different kinds of changes based on the literature, added/ rules, emerging pattern and unexpected changes. Also, there are two measures of similarity and unexpectedness to measure the change. In this study, one time we calculate changes based on these measures from the literature. Then, we modified these measures to calculate the difference between ordinal attribute to bring their information in the calculation of changes. Our contribution is modifying these change measure to bring more information and higher accuracy in change mining. The result presented in the chapter4. Marketing managers can apply these detected changes to be responsive accurately and timely to the changes in the market. In addition, they can use it to evaluate different marketing campaigns to build stronger relationship with their customer and knowing the market better. There are many implications for mining changes in macro in micro aspects of businesses and also in marketing campaigns and manufacturing.

3 Abstract:... 1 Chapter1: Introduction Background of the study: Problem definition: Purpose of this study: Research question: Research motivation: Demarcation: Research outline: Chapter2: Literature Review Mining Customer Behavior: Review of Data Mining Data mining: in brief Data mining Functions: Classification in brief: Clustering in brief: Association Rules in Brief: Association Rule Mining Review: Association Rule mining problem: Apriori Algorithm Association Rule Mining Approaches: Apriori Approach Mining Changes Literature Review: Customer segmentation review: Clustering Analysis Customer Segmentation Model RFM Model RFM Scoring Customer Value Matrix Model

4 Chapter3: Research Methodology Research Methodology: Research Design: Research Purpose: Research Approaches: Research Strategy: Research process: Data Collection and Description: Data Pre-Processing: Customer Segmentation: Customer Value Matrix An effective analytical tool Customer Value Matrix Methodology Mining Customer Behavior: Association Rule Mining: Apriori algorithm: Change Mining: Change Mining: Chapter4: Results & Analysis Data preprocessing result: Data Cleaning Data Transformation result: Customer segmentation (in sql server Customer Value Matrix Result: Customer Behavior Mining: Discretization Result: Association Rule Mining Results: Change Mining:

5 4.4.1 Some examples of change pattern: Association rules and changes based (Chen et al, 2005): Rules with discrete variables in RHS: Change mining with Manhattan distance Chapter5: Conclusion, further research Conclusion: Our contribution: Limitation: Managerial Implication: Future works: References: List of tables Table 2.1: Factors for classification of ARM..25 Table 2.2: Mining in a changing environment timetable 37 Table3.1: Data collected from Kalleh Company 52 Table3.2: calculating variables for customer value matrix 58 Table 4.1: RFM table fields.72 Table 4.2: calculating variables for customer value matrix...73 Table 4.3: calculating variables for customer value matrix...73 Table 4.4: segment information in for period Table 4.5: segment information in for period Table4.6: R quantile 76 Table4.7: M quantile...76 Table4.8: F quantile 77 4

6 Table4.9: Area quantile..78 Table 4.10: Generated rule summary.78 Table 4.11: Generated Rules for period 1 Cluster Table4.12: Generated Rules for period 2 Cluster 1 81 Table4.13: Generated Rules for period 1 Cluster 2 82 Table4.14: Generated Rules for period 2 Cluster 2 84 Table4.15: Generated Rules for period 1 Cluster 3 87 Table4.16: Generated Rules for period 2 Cluster 3 88 Table4.17: Generated Rules for period 1 Cluster 4 89 Table4.18: Generated Rules for period 2 Cluster 4 95 Table4.19:Cat1 quantile..98 Table4.20:Cat2 quantile.. 99 Table4.21:Cat3 quantile Table4.22:Cat5 quantile 101 Table4.23:Cat11 quantile..102 Table4.24:Cat13 quantile..103 Table4.25: Generated Rules for period 1 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance 103 Table4.26: Generated Rules for period 2 Cluster 1, Change mining by (Chen et al, 2005) measures & Manhattan distance 104 Table4.27: Generated Rules for period 1 Cluster 2, Change mining by (Chen et al, 2005) measures & Manhattan distance Table4.28: Generated Rules for period 2 Cluster 2, Change mining by (Chen et al, 2005) measures & Manhattan distance..107 Table4.29: Generated Rules for period 1 Cluster 3, Change mining by (Chen et al, 2005) measures & Manhattan distance

7 Table4.30: Generated Rules for period 2 Cluster 3, Change mining by (Chen et al, 2005) measures & Manhattan distance Table4.31: Generated Rules for period 1 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance Table4.32: Generated Rules for period 2 Cluster 4, Change mining by (Chen et al, 2005) measures & Manhattan distance List of figures: Figure 2.1: Knowledge Discovery in Database Processes Figure 2.2 the major steps in data mining process...17 Figure 2.3: Classification of DM techniques...17 Figure 2.4: Classic Problem of association rule mining.20 Figure 2.5: Mining in a changing environment review...36 Figure 2.6: Customer Value Matrix.44 Figure 3.1: Research design of this study 46 Figure 3.2: Change mining process perspective..49 Figure 3.3: Change mining process.50 Figure 3.4: Change mining process in detail...50 Figure 3.5: Product categories of Kalleh company.52 Figure 3.6: customer value matrix..59 Figure 4.1: generalized product category...71 Figure 4.2: The Customer Value Matrix...74 Figure4.3: R histogram Figure4.4: M histogram..76 Figure4.5: F histogram 77 Figure4.6: Area histogram..78 Figure4.7: Cat1 histogram

8 Figure4.8: Cat2 histogram..99 Figure4.9: Cat3 histogram 100 Figure4.10: Cat5 histogram Figure4.11: Cat11 histogram 102 Figure4.12: Cat13 histogram

9 Chapter1: Introduction Background of the study Problem definition Purpose of this study Research question Research motivation Research demarcation Research outline 8

10 1.1Background of the study: The world around us changes continuously. Knowing and adapting to changes is an important aspect of our lives. For businesses, knowing what is changing and how it has changed is also essential (Liu et al, 2000). One of the most important aspects of surviving in a dynamic market is to know and adapt to changes happening in customer behavior. Moreover, in recent years, there has been the explosive growth in the amount of information (Min, S., H., Han, I., 2005). In general, Fast moving consumer goods (FMCG) distribution companies collected huge amount of data from their customers and their purchasing transactions. In this gathered data, we can find interesting hidden information about the customers and their behaviors. The traditional approach for marketing decision making for marketing promotions, campaigns and market research in FMCG distribution companies is to focus more on their internal expert opinions. These experts include the marketing managers and also sales managers who are in constant touch with their salespeople and merchandisers who bring them market information. However, this kind of decision making process ignores the customer data and their behaviors. Furthermore, in today s world where the market is highly competitive and products are overwhelming, customers face with various products and various providers with different marketing strategies (Hossein Javaheri, S., 2008). In such a situation, customer behavior changes all the time due to such a dynamic market (Chen et al, 2005). When the marketing manager became aware of some changes in the market by sales team; he/she does not have any idea about how and where to start understanding these changes and their reasons. It results to design a wide time-consuming and costly market research which its result maybe did not reach on time to the marketing department to react to these changes. Also in such a market, there are many promotion campaigns by company itself and competitors that it is difficult to analyze the effectiveness of them in the market. So, in the competitive environment, there is a need to mine customer data and their transactions to find changes in customer purchasing behavior which is an effective and efficient way to respond to their needs timely and accurately. As a result, many FMCG distribution companies in Iran are trying to move away from traditional way 9

11 for planning their marketing campaigns, promotions and market research by understanding changes happening in their customers purchasing behavior. Change mining helps managers to make better marketing strategies. 1.2Problem definition: Kalleh Company is a private manufacturer and distributor of food product in Iran. It produces different categories of food product from dairy products to ice cream and meats and sauces. It has more than 10 different categories and about 800 products. Now, the company is faced with the challenge of increasing competition. There are some reasons behind it. First, according to the high variation of products, it should compete in different food market like dairy, ice cream and meat. It results to compete with many competitors with different product categories and different marketing strategies. Also there are some powerful governmental companies that make competition tougher for Kalleh. So in such a market, the customer behavior may change by the of companies strategies in the market and also by changing their need by themselves. Kalleh Company in order to answer to the changes in customer purchasing behavior timely and not being behind the customer needs and the competition need to mine changes in the customer purchasing behavior. The goal of Kalleh Company is to mine changes in purchasing behavior of the customers in different segments to respond to these changes timely and accurately to increase its return on investment (ROI). 1.3Purpose of this study: The purpose of this study is to mine changes in customer purchasing behavior. In order to reach this goal we need to building customer purchasing patterns of customers based on the customer, product and transaction data collected in databases. Data mining techniques can help us to reach this goal. According to (Song et al, 2001), data mining is the process of exploration and analysis of large quantities of data in order to discover meaningful pattern and rules. Many of data mining Studies has focused on developing techniques to build precise models to predict customer s behavior, and to set up marketing strategies and customization. According to (Nemati & Barko, 2001; cited by Nemati, H.R., Barko, C. D., 2003), most of data mining applications (72%) are centered on predicting customer 10

12 behavior. Comparatively little attention has been paid to discover changes in databases collected eventually (Liu et al., 2000). From literature review, what is obvious is too much time spent on worrying about absolute numbers, like Lifetime Value. However, what they should really be observing is relative numbers change over time. Highest potential ROI customers from a marketing viewpoint are Customers who are in the process of changing their behavior either accelerating their relationship with you, or ending their relationship with a company (Novo, j., 2008). In many applications, mining changes can be more crucial than producing precise prediction models, which are in the center of existing data mining researches. Regardless of how the model is accurate, it is inactive by itself because it can only predict based on patterns mined in the old data. Acting based on the built model should not guide to actions that may change the environment because otherwise the model will stop to be correct (Liu et al., 2000). Prediction model building is more appropriate in areas where the environment is comparatively steady. However, in many business conditions, constant human interference to the environment is a fact. Businesses simply cannot let nature take its course. They constantly need to do actions in order to provide better services and products by finding the attractive changes and steady patterns in customer behaviors. Still in a comparatively steady environment, changes are also unavoidable due to internal and external issues (Liu et al., 2000). From these viewpoints the question: Which patterns exist? as it is responded by state-of-the art data mining technology, is replaced by the question: How do patterns change? (Böttcher, M., et al, 2006). Actually, discovery of interesting and earlier unidentified changes in customer, product and transaction data, not only let the user monitor the influence of past business decisions but also to get ready today s business for tomorrow s needs (Böttcher, M., et al, 2006). Major changes often need instant concentration and actions to modify the existing practices and/or to change the domain condition (Liu et al, 2000). By using change mining methodology, Kalleh Company can detect different kinds of changes happening in the customer purchasing behavior to build stronger relationship with the customers. Also, understanding changes in customer behavior can assist managers to set up effective and efficient promotion campaigns. (Liu et al, 2000) mentioned that there are two main goals for mining changes in a business environment: 11

13 "To follow the s": The main feature of this kind of applications is the word "follow". Companies like to know where the is going not to be left behind. They need to investigate customers' changing behaviors so as to provide products and services that suit the changing needs of the customers. "To stop or to delay undesirable changes": In this kind of applications, the keyword is "stop". Companies like to know undesirable changes as soon as possible and to plan corrective measures to stop or to delay the pace of such changes. The overall procedure consists of several steps. In the literature, there are some methods for change mining in the dynamic situation. According to (Song et al, 2001), the majority of data mining techniques like association rules and neural networks cannot be used alone because they cannot manage dynamic situation well. (Song et al, 2001) and (Chen et al, 2005) developed a methodology for mining changes. They used association rule to detect interesting association relationships among a large set of data items which introduced by (Agrawal et al., 1993). The methodology detects all kinds of changes. According to (Chen et al, 2005), Change mining has several steps including data preprocessing, customer segmentation, mining association rule and change mining. In the first customers are segmented based on their behavioral variables, recency, frequency and monetary (RFM). Then by building association rule with customer behavioral variable (RFM), customer data and transaction data, we describe the customer purchasing behavior in two different time snapshots, and in the end we compare generated rules for each segment to mine changes in the customer purchasing behavior. To mine changes, various algorithms and techniques should be used. In order to implement these algorithms and techniques, an extensive programming is needed. Finally, we combined all of the algorithms to build a change mining package. 1.4Research question: Based on the problem discussion that we have above, the purpose of this study is to mine changes in customer purchasing behavior. In order to reach this purpose, the research question will be as followed: How businesses can be responsive to the changes of customer behavior in dynamic market. In addition, how businesses can detect and access to the changes happened in the customer behavior pattern to be responsive accurately and timely. 12

14 1.5Research motivation: Recently, we have watched an explosion of data produced and collected by individuals and organizations. This fast growth in data and databases made the problem of data overload (Li, X. B., 2005). More recently, increased computing power has led to greater elasticity in the models one can use and the amount of data that can be stored and processed (Bolton, R. J., 2004) and as a result, data mining techniques have came out and flourished in the past several years to encounter this demand (Li, X. B., 2005). Organizations are starting to understand the importance of data mining in their marketing strategies. In this situation, businesses currently face the challenge of a constantly evolving market where customer needs are changing all the time (Chen et al, 2005). In such an environment, knowing the changes and responding rapidly and correctly to them, has a high importance. While customer needs change over time, if businesses could not meet their need, they would lose their customers who are their ROI resources. Some works have been done in change mining in retailing. One of the businesses that change mining can help it to improve, is FMCG distribution business that face a dynamic markets by huge variation of products and competitors in the market. The purpose of the change mining is following the s that are happening in the customer purchasing pattern, detecting the changes and respond to them timely to satisfy customers more and meet their needs. 1.6Demarcation: This study focus on mining changes in customer purchasing behavior based on the customer and purchasing transaction stored in a database. Change mining has been done by data gathered from a database of FMCG Distributor Company in Iran. Most of the literature reviewed is about mining changes in customer purchasing behavior. Our work focus on building customer behavior patterns by association rule mining and the comparison of these built rules. These patterns just based on their previous transactions. 1.7Research outline: This thesis consists of five chapters. The first chapter is introduction that gives a brief background about subject followed by research question, objectives, and motivation. Chapter 2 is a literature review, consists literature review on data mining, association rule, change mining and customer segmentation. Chapter3 is about our research methodology including data preprocessing, market segmentation, mining customer behavior and change mining. Chapter4 is about the results and analysis. Chapter 5 is the last chapter that contains conclusion, limitation, and further research. 13

15 Chapter2: Literature Review Review of Mining Customer Behavior Review of Data Mining Review of Association Rule Mining Review of Change Mining Review of Customer Segmentation 14

16 2. 1Mining Customer Behavior: Different methods to describe customer behavior exist in the literature. Among them, there are various types of conjunctive rules to build customer behavior pattern including association rules and classification rules (Agrawal R. et al, 1996 & Breiman L., et al., 1984 cited on Adomavicius, G., Tuzhilin, A., 2001) Using rules to describe customer behavior has certain advantages. Besides being descriptive way to portray behaviors, a conjunctive rule is a well-studied concept and it is used widely in data mining, expert systems, and many other areas. In addition, researchers have proposed many rule discovery algorithms in the literature, especially for association rules (Adomavicius, G., Tuzhilin, A., 2001). To discover rules that describe the behavior of customers, we can use various data mining algorithms, like Apriori for association rule mining. Association rules were initially applied for market basket analysis to find the relationships between product items purchased by customers at retail stores (Agrawal, Imielinski, & Swami, 1993; Srikant, Vu, & Agrawal, 1997 cited by Chen et al, 2005). In a research of customer behavior, we can apply association rule to find the correlations between customer demographic variables, purchased product and product databases (Song et al, 2001). In this chapter, we will have a review of data mining, then association rules. Then the next topic will be the change mining of customer behavior in the literature. And following by that finally we will have a brief review of customer segmentation. 2. 2Review of Data Mining Data mining: in brief Today, size of databases can be very large. Within this data you can find hidden strategic information. But when you have a huge amount of data, inducing meaningful conclusions is not easy. The novel answer is data mining being used both to increase revenues and to reduce costs. Many people use data mining as a synonym for another popular word, Knowledge Discovery in Database. In rotation other people define Data Mining as the core process of KDD. The KDD processes are shown in Figure 2.1 (Han, J., & Kamber, M., 2006). Usually KDD has three processes. First one is preprocessing executed before data mining techniques applied to the right data. The preprocessing includes data cleaning, integration, selection and transformation. The main process of KDD is the data mining process. In this process different algorithm are applied to produce 15

17 hidden knowledge. The last process is post-processing comes evaluating the mining result according to users requirements and domain knowledge. Regarding the evaluation results, if the result is satisfactory the knowledge can be presented; else we have to run some or all of those processes again till we get the satisfactory result (Han, J., & Kamber, M., 2006). Figure 2.1: Knowledge Discovery in Database Processes (Song et al, 2001) defines data mining as a process of exploration and analysis of large quantity of data to discover meaningful patterns and rules. (Feelders et al, 2000) define the process of data mining as follows: 16

18 Source: (Feedlers et al, 2000) Figure 2.2 the major steps in data mining Process The data mining returns potential is immense. Innovative organizations worldwide are already using data mining to attract higher-value customers, to configure their product offerings differently to increase sales, and to minimize losses due to mistakes or fraud Data mining Functions: (Dunham, 2002) categorizes data mining to two categories, one is descriptive and the other one is predictive (Figure 2.3). Source: (Dunham, 2002) Figure 2.3: Classification of DM techniques The first and simplest analytical step in data mining is to describe the data- 17

19 summarize its statistical attributes such as means and visual review like charts and graphs, and correlations among variables. The most important step is right data selection, data gathering and data exploration. Sometimes data description alone cannot provide an action plan. You must build a predictive model based on patterns determined from known results, and then examine that model with a new sample data. A good model should never be the same as reality, but it can be a useful guide to know your business. And after all we should empirically verify the model (Twocrows.com, 2005). In the next section, we explain briefly three important data mining techniques Classification in brief: Based on (Han and Kamber, 2006), Classification is automatic model building that can classify a class of objects so as to predict the classification or missing attribute value of future objects whose class may not be known. The process has 2 steps. In the first step, a model is built to describe the characteristics of a set of data classes or concepts based on the collection of training data set. Because data classes or concepts are predefined, this step is also known as supervised learning. In the second step, the model is used to predict the classes of future data or objects. There are several techniques for classification (Han and Kamber, 2006). In Classification by decision tree many researches are done and plenty of algorithms have been designed, Murthy did a extensive survey on decision tree induction (Murthy, 1998; cited by Han, J., & Kamber, M., 2006). Bayesian classification is another technique that can be found in (Duda and Hart, 1973 cited by Han, J., & Kamber, M., 2006). Nearest neighbor methods are also talked about in many statistical texts on classification, such as (Duda and Hart, 1973, cited by Han, J., & Kamber, M., 2006) and (James, 1985, cited by Han, J., & Kamber, M., 2006). Besides, there are many other machine learning and neural network techniques used to help building the classification models Clustering in brief: As we mentioned before, classification can be taken as supervised learning process, clustering is another mining technique similar to classification. However clustering is an unsupervised learning process. "Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects" (Han, J., & Kamber, M., 2006), so that objects within the same cluster must be similar to some extend, also they should be dissimilar to those objects in other clusters. In classification each record belongs to a predefined class, while in clustering there is no predefined class. In clustering, objects are grouped together based on their similarities. (Han, J., & Kamber, M., 2006)Similarities 18

20 between objects are explained by some similarity functions; usually similarities are quantitatively defined as distance or other measures by corresponding domain experts. (Han, J., & Kamber, M., 2006) Most clustering applications are used in market segmentation. When they cluster their customers into different groups, business organizations can provide different personalized services to different group of markets. (Han, J., & Kamber, M., 2006) An extensive survey of current clustering techniques and algorithms is available in (Berkhin, 2002; cited by Han, J., & Kamber, M., 2006) Association Rules in Brief: Association rule mining is one of the most important techniques of data mining. (Agrawal et al, 1993) introduced this method first time. The goal of this technique is extracting interesting correlations, frequent patterns, and associations among sets of items in the transaction databases or other data reservoirs (Agrawal et al, 1993). Association rules are used extensively in various areas. In this study we will use association rule to mine customer behavior pattern to find behavioral changes. In the next section, we will have a review of association rule mining. 2. 3Association Rule Mining Review: Association Rule mining problem: In this section, we will introduce association rule mining problem in detail. A typical association Rule has an implication of the form A B where A is an itemset and B is an itemset that contains only a single atomic condition (Berry & Linoff, 2004). There are two definitions to evaluate each association rule. The support of an association rule is the percentage of records containing both A and B and the confidence of a rule is the percentage of records containing itemset A that also contain itemset B. The support shows the usefulness of a discovered rule and the confidence shows certainty of found association Rules (Berry & Linoff, 2004). We can calculate another variable called Lift. It Measures the difference between confidence and expected value of confidence for a rule. (Berry & Linoff, 2004) define Lift (also called improvement), as a measure telling us how much better a rule is at forecasting the result than just assuming the result in the first place. Lift is the ratio of the density of the target after application of the left-hand side to the density of the target in the population (Berry & Linoff, 2004). Another way of saying this is that lift is the ratio of the records that support the entire rule to the number that would be expected, assuming that there is no relationship between the products (the exact formula is givenlater in the chapter) (Berry & Linoff, 2004). 19

21 2.3.2 Apriori Algorithm Association rule mining is discovering association rules that satisfy the pre-defined minimum support and confidence from a database (Agrawal, R., & Srikant, R., 1994). According to (Agrawal, R., & Srikant, R., 1994), this problem is usually decomposed into two sub problems: One is to find those itemsets whose occurrences surpass a predefined threshold in the database; those itemsets are called frequent or large itemsets. This problem can be later divided into 2 sub problems: candidate large itemsets generation and frequent itemsets generation process. Large or frequent itemsets are those itemsets whose supports surpass the support threshold as and candidate itemsets are those itemsets that are expected or have the hope to be large or frequent. The second problem is producing association rules from those large itemsets with the limits of minimal confidence. You can see the whole process of standard problem of mining association rules in figure 2.3. Source: (Agrawal et al, 1993) Figure 2.4: Classic Problem of association rule mining The whole performance of mining association rules is determined mainly by the first step (Agrawal, R., & Srikant, R.). After the large itemsets are found, the corresponding association rules can be derived in a straightforward manner. the focus of most mining algorithms is counting of large itemsets Efficiently, and many efficient solutions have been designed to target previous criteria (Kantardzic.M, 2003). 20

22 Different kinds of produced AR: One attraction of association rules is the clarity and utility of the results, which are in the form of rules about groups of products. There is a spontaneous attraction to an association rule because it shows how tangible products and services group together (Berry & Linoff, 2004). While association rules are easily understandable, they are not always useful (Berry & Linoff, 2004). There are 3 types of generated association rules: Actionable rules, trivial rules and inexplicable rules. Actionable rules are the useful rule holds high-quality, actionable information. Once the pattern is found, it is not often hard to justify, and thinking about rule in the real environment can lead to insights and actions. Because the rule is easily understood, it recommends plausible causes and possible interventions (Berry & Linoff, 2004). Another type of association rule is trivial rules.. Many people in business know trivial results. Although it is valid and well supported in the data, it is still not practical. A simple example is customers purchasing hamburgers buy hamburger buns. A subtler problem drops within the same category. An apparently interesting result may be the result of past marketing programs and product bundles. Although other data mining techniques have this problem but market basket analysis is vulnerable to reproducing the success of prior marketing campaigns because of its dependence on un-summarized point-of-sale data, exactly the same data that defines the success of the campaign. Trivial rules have one advantage and that is when a rule should appear 100 percent of the time, the few cases where it does not hold supply a lot of information about data quality. An area where business operations, data collection, and processing may need to be more refined indicates the exceptions to trivial rules (Berry & Linoff, 2004). Inexplicable results seem to have no interpretation and do not recommend a course of action. There is a caution and that is when applying market basket analysis, many of the results are often either trivial or inexplicable; trivial rules reproduce common knowledge about the business, which waste the effort used to apply complex analysis techniques and Inexplicable rules are flukes in the data and are not actionable (Berry & Linoff, 2004). ARM Approaches Classification: Association rule mining is a well studied research area; in this section, we will only review some basic and classic approaches for association rule mining. As 21

23 mentioned before, the second sub-problem of ARM is straightforward; most of those approaches focus on the first sub-problem. As mentioned, the first sub-problem can be further divided into two sub-problems: candidate large itemsets generation process and frequent itemsets generation process. Most of the algorithms of mining association rules that surveyed are quite similar, the difference is the extent to which specific improvements have been made. According to (Zhao, Q., Bhowmick, S.S., 2003), there are 3 milestones in ARM classic problem; Apriori approach, tree structure approaches and special issues in ARM. Besides these approaches, there is another approach from (Zaki et al, 1999); class-based algorithms approach. There some features that exists in literature to classify ARM algorithms by different aspect. In the following subsection we will see some of them. Here there are some features, which can be used to classify the algorithms. We can categorize the algorithms based on several basic features that try to best differentiate the various algorithms. These are different features that we have found in literature (summarized in Table 2.1): Target: Basic association rule algorithms actually find all rules with the acceptable support and confidence thresholds. However, there are some more efficient algorithms could be used. One approach which has been done to do this is adding constraints on the rules which have been produced. Algorithms can be categorized as complete (All association rules satisfying the support and confidence are found), constrained (Some subset of all the rules are found, based on a technique limiting them), and qualitative (A subset of the rules are produced based on additional measures, beyond support and confidence, need to be satisfied) (Dunham M.H., et al, 2001). Type: Here we show the type of association rules which are produced (for example regular (Boolean), spatial, temporal, generalized, qualitative, etc.) (Dunham M.H., et al, 2001). Data type: Besides data stored in a database, the type of data also is important. Association rules of a plain text might be very important information to find out. For example, data, mining, and decision may be highly dependent in a paper of knowledge discovery (Dunham M.H., et al, 2001).. Data source: In addition to market basket data, association rules of data absent in the database might play important role for decision purposes of a company (Dunham M.H., et al, 2001). 22

24 Technique: All approaches to date are based on first finding the large itemsets. There could, of course, be other techniques not requiring that large itemsets first be found. Although to date we are not aware of any techniques not generating large itemsets, certainly this possibility does exist with the potential of improved performance. However, (Agrawal et al, 1998) cited in (Dunham M.H., et al, 2001) proposed strongly collective itemsets to evaluate and find itemsets. The term support and confidence are completely different from large itemset approach. An itemset I is said to be strongly collective at level K if the collective strength C (K) of I as well as any subset of I is at least K (Dunham M.H., et al, 2001). Itemset Strategy: Different algorithms consider the generation of items differently. This feature shows how the algorithm considers transactions as well as when the itemsets are produced. One technique, Complete, could produce and count all potential itemsets. The most common approach is that introduced by Apriori. With this strategy, a set of itemsets to count is produced prior to scanning the transactions. This set remains constant during the process. A dynamic strategy produces the itemsets during the scanning of the database itself. A hybrid technique generates some itemsets prior to the database scan, but also adds new itemsets to this counting set during the scan (Dunham M.H., et al, 2001). Transaction Strategy: Different algorithms consider the set of transactions in a different manner. This feature shows how the algorithm scans the set of transaction. The complete strategy checks all transactions in the database. With the sample approach, some subset of the database (sample) is checked prior to processing the complete database. The partition techniques divide the database into partitions. The scanning of the database requires that the partitions be checking individually and in order (Dunham M.H., et al, 2001). Itemset Data Structure: As itemsets are produced, different data structures can be applied to keep track of them. The most usual approach seems to be a hash tree. Alternatively, a trie or lattice may be applied. At least one technique suggests a virtual trie structure where only a portion of the complete trie is actually materialized (Dunham M.H., et al, 2001). Transaction Data Structure: "Each algorithm assumes that the transactions are stored in some basic structure, usually a flat file or a TID list" (Dunham M.H., et al, 2001). Optimization: Many algorithms have been introduced improving on earlier 23

25 algorithms by applying an optimization strategy. Various strategies have considered optimization based on available main memory, whether or not the data is skewed, and pruning of the itemsets to be counted (Dunham M.H., et al, 2001). Architecture: As indicated, the goal of some algorithms is working like sequential function in centralized single processor architecture. Alternatively, algorithms have been designed to work in a parallel manner suitable for a multiprocessor or distributed architecture (Dunham M.H., et al, 2001). Parallelism Strategy: Parallel algorithms can be more described as task or data parallelism (Dunham M.H., et al, 2001). In the literature there some other features that based on them also we can categorize the association rule mining methods; in the following we can consider them: Counting Strategy: This refers to the methods used in counting the candidate itemsets occurrences. There horizontal counting and vertical intersection are two main approaches. The horizontal counting decides about the support value of a candidate itemset by scanning transaction singly, and increasing the counter of the itemset if it is a subset of the transaction. This approach operates well for a rarely occurred candidate because only those transactions containing that itemset need to be checked. The candidate look up operation, however, is very expensive for candidates of large size (Su, J. H., Lin, W. Y., 2004). On the other hand, vertical intersection is applied when the database is in a vertical format such that every record is associated with an item to store the identifiers of the transactions containing that item, called Tidlist. Despite the vertical intersection scheme omits the I/O cost for database scan, it has the following shortage: when a candidate itemset has a support count completely less than the number of transactions, a large amount of unnecessary intersections happens there (Su, J. H., Lin, W. Y., 2004). Search direction: according to (Su, J. H., Lin, W. Y., 2004), there are two main methods for search direction, Bottom-up traversal and Top-down traversal. Today, most Apriori-like approaches apply bottom-up traversal of the search space, which starts from all frequent 1-itemsets upward to the longest frequent itemsets. The most important advantage of this model is that it can effectively prune the search space by exploiting downward closure property: when it recognized one itemset as infrequent, all of its superset is also infrequent. However, this benefit fades when most of the maximal frequent itemsets locating near the largest itemset of the search 24

26 lattice, due to a comparatively small support threshold. In this situation, there are very few itemsets to be pruned (Su, J. H., Lin, W. Y., 2004). Another itemset traversal method is Top-down traversal which applied in the opposite direction, i.e. starting from the longest itemsets downward to the frequent 1-itemsets, or top-down for short (Su, J. H., Lin, W. Y., 2004). This strategy is traditionally applied for discovering maximal frequent itemsets (Tseng, M.C. & Lin, W.Y., 2001; cited by Su, J. H., Lin, W. Y., 2004) But we should consider that though all of the frequent itemsets can be derived from their maximal ones, more counting strategies are needed to gain their exact supports for computing the confidences of association rules. At the same time, if there are huge numbers of items and/or the support threshold is very low; many infrequent itemsets have to be visited before the maximal frequent itemsets are identified. This is why most work on frequent itemsets mining accepts and applies the bottom-up paradigm instead. (Su, J. H., Lin, W. Y., 2004). Search strategy: While the search direction directs the way that the search space is exploited, the search strategy identifies the order in which itemsets are visited (Su, J. H., Lin, W. Y., 2004). One of these strategies is BFS. Most Apriorilike algorithms apply breadth-first search (BFS) because it can facilitate the pruning of candidates with downward closure. This strategy, however, needs more memory to keep the frequent subsets of the pruned candidates (Su, J. H., Lin, W. Y., 2004). Another strategy is DFS; recursively visiting the descendants of an itemset. In the literature, this strategy is usually combined with the counting strategy of vertical intersection because it is enough to keep in memory the tidlists corresponding to the itemsets on the path from the root down to the presently inspected one. (Su, J. H., Lin, W. Y., 2004) Table 2.1: Factors for classification of ARM VALUES Complete, Constrained, Qualitative Regular (Boolean), Generalized, Quantitative, etc. Database Data, Text Market Basket, Beyond Basket DIMENSION Target Type Data type Data source 25

27 Large Itemset, Strongly Collective Itemset Complete, Apriori, Dynamic, Hybrid Complete, Sample, Partitioned Hash Tree, Trie, Virtual Trie, Lattice Flat File, TID Memory, Skewed, Pruning Sequential, Parallel None, Data, Task Sequential Pattern, Frequent Itemset, Structured Pattern Association Rule, Strong gradient relationship, correlation(han book) Horizontal, Vertical Bottom-Up Traversal, Top-Down Traversal, Hybrid BFS, DFS Complete, Heuristic Technique Itemset Strategy Transaction Strategy Itemset Data Structure Transaction Data Structure Optimization Architecture Parallel Strategy Pattern Kind Rule Kind Counting Strategy Search Strategy Search Direction Candidate generation Association Rule Mining Approaches: Apriori Approach AIS Algorithm: The AIS (Agrawal, Imielinski, Swami) algorithm was the first algorithm suggested for mining association rule in (Agrawal et al, 1993). It concentrates on improving the quality of databases simultaneously with necessary functionality to process decision support queries. According to (Zhao, Q., Bhowmick, S.S., 2003), in this algorithm only one item consequent association rules are produced. It means that the consequent of those rules only contain one item, for example we only 26

28 produce rules like ABC D but not rules like AB CD. Disadvantage: The main disadvantage of the AIS algorithm is too many candidate itemsets that at last turned out to be small are produced, needing more space and wastes much effort that turned out to be useless. At the same time this algorithm needs too many passes over the whole database (Zhao, Q., Bhowmick, S.S., 2003) Apriori Algorithm: Apriori is a great improvement in the history of association rule mining, Apriori algorithm was first introduced by Agrawal in (Agrawal, R., & Srikant, R., 1994). The AIS is just a straightforward approach that needs many passes over the database, which produces many candidate itemsets and saving counters of each candidate while most of them turn out to be not frequent. Apriori is more efficient during the candidate generation process for two reasons; Apriori applies a different candidate's generation method and a new pruning technique (Zhao, Q., Bhowmick, S.S., 2003). a) Problem & limitation of Apriori: One is the complex candidate generation process that spends most of the time, space, and memory. Another bottleneck is the several scan of the database. Many new algorithms were designed with some modifications or improvements based on Apriori algorithm. Commonly, there were two approaches: First approach tries to reduce the number of passes over the whole database or replace the whole database with only part of it based on the current frequent itemsets. The other approach tries exploring different types of pruning techniques to make the number of candidate itemsets much lesser. Apriori-TID and Apriori- Hybrid (Agrawal, R., & Srikant, R., 1994), DHP (Park et al, 1995; cited by Zhao, Q., Bhowmick, S.S., 2003), SON (Savesere et al, 1995) are modifications of the Apriori algorithm (Zhao, Q., Bhowmick, S.S., 2003) Optimized Apriori algorithms: According to problems of Apriori, which have been mentioned in previous section, some new approaches are introduced. In the following section we will have them; item pruning and database passes over reduction. a) Transaction and Item Pruning: This is one of the main optimization of the Apriori Algorithm. There is no 27

29 need to inspect the whole database each time it is needed to count occurrence of candidate itemsets. This optimization reduced drastically the needed time to count the support for the candidate sets and enhances the performance. Transaction pruning was present in 2 algorithms; AprioriTid, Apriori Hybrid and DHP. AprioriTID, Apriori Hybrid: AprioriTID was introduced in the same paper with Apriori. For all that, it does not state it explicitly, it uses transaction pruning to improve Apriori performance. The main difference comes from where it does not use the whole database to count support for candidate sets, and it uses another approach (Ayad, A. M., 2000).The main disadvantage of this algorithm is the size of the alternative set that shows the database may go beyond the size of the actual database in early stages thus loosing its edge on Apriori. Because of this disadvantage another algorithm, Apriori Hybrid introduced. It uses Apriori at the first stages and then shifts to AprioriTID when transaction pruning is more effective (Ayad, A. M., 2000). DHP: DHP (Dynamic Hashing and Pruning algorithm) is another algorithm that introduced by (Park et al, 1995). It uses probabilistic counting to decline the number of candidate itemsets counted during each round of Apriori execution. This decline is completed by subjecting each candidate k itemset to a hash-based filtering step in addition to the pruning step (Ayad, A. M., 2000). Throughout candidate counting in round k -1, the algorithm builds a hash table. Each entry in the hash table is a counter that retains the sum of the supports of the k-itemsets that correspond to that exacting entry of the hash table. The algorithm uses this information in round k to prune the set of candidate k-itemsets. After subset pruning as in Apriori, the algorithm can remove a candidate itemset if the count in its hash table entry is smaller than the minimum support threshold. According to, It has 2 advantage, first this algorithm is also based on the monotone Apriori property, where a hash table is built for the purpose of reducing the candidate space by pre-computing the proximate support for the k+1 item set while counting the k-itemset. DHP has another important advantage, the transaction trimming, which has been applied by removing the transactions that do not contain any frequent items. However, this trimming and the pruning properties caused some problems that made it impractical in many cases (Ayad, A. M., 2000). 28

30 b) Reducing the number of database passes: As mentioned before, the main disadvantage of the classical Apriori is the several passes it has to do on the databases the number of which is equal to the length of the longest frequent itemset (Pattern) present in the database(zhao, Q., Bhowmick, S.S., 2003). Many optimization efforts focused on eliminating the number of database passes. They differed, however, in how the number of passes decreased. This is the focus of this section. Database Partitioning: (Savasere et al, 1995) developed Partition, an algorithm that requires only two scans of the transaction database. The database is divided into disjoint partitions, each small enough to fit in memory. In a first scan, the algorithm reads each partition and computes locally frequent itemsets on each partition using Apriori. In the second scan, the algorithm counts the support of all locally frequent itemsets toward the complete database. If an itemset is frequent with respect to the complete database, it must be frequent in at least one partition; therefore, the second scan counts a superset of all potentially frequent items. The main achievement of Partition is the reduction of database activity. It was shown that this reduction was not obtained at the expense of more CPU utilization. It was shown however, that the number of partition greatly affects the performance of the algorithm by affecting the number of locally frequent itemsets that turn to be globally infrequent. The algorithm was shown to be vulnerable to data skew (Ayad, A. M., 2000). Dynamic itemset counting: (Brin et al, 1997) proposed the Dynamic Itemset Counting algorithm. DIC partitions the database into several blocks marked by start points and repeatedly scans the database. In contrast to Apriori, DIC can add new candidate itemsets at any start point, instead of just at the beginning of a new database scan. At each start point, DIC estimates the support of all itemsets that are currently counted and adds new itemsets to the set of candidate itemsets if all its subsets are estimated to be frequent (Brin et al, 1997). If DIC adds all frequent itemsets and their negative border to the set of candidate itemsets during the first scan, it will have counted each itemset s exact support at some point during the second scan; thus DIC will complete in two scans (Ayad, A. M., 2000). The Dynamic Item set Counting (DIC) 29