CERTIFICATE This is to certify that the thesis titled Data Mining in Retailing in India : A Model Based Approach submitted by Ruchi Mittal to Maharishi Markandeshwar University, Mullana (Amabala) for the award of the degree of Doctor of Philosophy in Computer Science, is a bonafide record of original work done under my supervision and guidance. The work contained in this thesis has not been submitted to any other University or Institute for the award of any other degree or diploma. Dr. NAVEETA MEHTA Associate Professor M.M. Institute of Computer Technology & Business Management M.M. University, Mullana (Ambala) Haryana (India) i
ACKNOWLEDGEMENT I express gratitude to my supervisor, Dr. Naveeta Mehta, Associate Professor, Maharishi Markandeshwar Institute of Computer Technology & Business Management, M.M. University, Mullana, whose untiring guidance had made it possible for me to complete this work. Her dedication to academic life, discipline, and straight forward approach has had a great impact on my professional and personal life. She is humane with willingness to help others, care for everyone, and always being concerned about the progress. With these rare qualities, I found in her not merely supervisor but a noble soul, a Guru. I express my gratitude to Dr. Dimple Juneja, Principal and Professor, Maharishi Markandeshwar Institute of Computer Technology & Business Management, M.M. University, Mullana, for her mentorship and guidance at various stages of this research. I am indebted to all my colleagues at MAIMT, Jagadhri, especially my Director, Dr. Raj Kumar for the constant support and cooperation throughout my thesis. I would also like to thank, though it is difficult to put it into words, my gratitude to Dr. Anil Kapil, Professor and Head, Computer Science and Technology, Haryana Institute of Engineering & Technology, Kaithal and Dr. Sangeeta Gupta, Director & Professor, Om Institute of Technology and Management (Mgt), Juglan, Hisar, who have always been there to extend all moral support and professional mentoring. iii
I also like to thank my mother, Smt. Raj Aggarwal, who more than a mother has always been a friend for me and has always been with me through thick and thin. I am also thankful to my better half, Dr Amit Mittal for his personal and academic support and my child, Yash to whom, I dedicate this thesis. I am especially thankful to my brother and Prime Minister awardee, Ishan Aggarwal, who, due to his academic achievements has raised the bar of academic excellence in the family. I am also thankful to my father, Shri Ishwar Aggarwal, my father-in-law, Shri Sat Paul Mittal and mother-in-law, Smt. Trishla Mittal, for showing faith in me and for ensuring a conducive environment for my professional pursuits. I offer my regards to all those who I am not mentioning, but supported me or inspired me in any respect during the course of the completion of my work. Last but not the least; I thank almighty GOD for always being there and for seeing me through the tough times. RUCHI MITTAL If I have seen further, it is by standing on the shoulders of giants --- Isaac Newton iv
ABSTRACT Data mining is an inter-disciplinary emerging field that focuses on access of information useful for high-level decisions and includes Machine Learning, Statistics and Probabilities, On Line Analytical processing, Data visualization, Information science, High-performance computing, etc. Data mining enables business executives to manage their data and to make relevant decisions. Simply stated, data mining refers to extracting or mining of knowledge from large amount of data. Retail is amongst the major fields of application of data mining technology. It is India s largest industry accounting for over 10 per cent of the GDP and 8 per cent of employment. In India, the industry is facing the new millennium, and the models of the past are not sufficient to ensure tomorrow s successes. Firms are increasingly relying on data mining techniques which use existing databases to devise new strategies for growth, profitability and customer loyalty. The thesis starts with the discussions on the concepts of database management systems, data warehousing and then data mining. It provides the historic development of data mining and retailing in India. This also provides the background material for the research problem. The objectives, scope and significance of the study have also been clearly outlined. Then Review of Literature provides the theoretical and conceptual framework of the research. This thesis reviews the work done in the field of interest identified since 1983. The period for this research is purposively selected so as to ensure that the technology under review i.e. Data Mining; has had sufficient time to prove its usefulness in prediction and in ensuring its use brings positive results to organizations. The major v
concepts and technologies reviewed are: Data Mining and Business Intelligence; Customer Segmentation and Profiling; Store Image/ Attributes; Predictive modeling through Data Mining; Cluster analysis; Factor analysis; Multiple regression Analysis. This section also identifies all the important variables and seeks to identify the gaps in the research done in the field both in India and abroad. The next section discusses the various data mining concepts, functionalities, tools and techniques. The disciplines of statistics and data mining have also been discussed to prove that these areas are highly interrelated and share a symbiotic relationship. This section helps to gain a major understanding of the various data mining algorithms and the way these can be utilized in various business applications and the way these algorithms can be used in the descriptive and predictive data mining modeling. Then the research design in terms of the type of research, the sampling plan, and the designing of the survey instrument (questionnaire) have been discussed. This section also gives the detailed description on the various data mining techniques that have been used to achieve the research objectives. The next section relates to the data analysis of the data collected through the survey and the interpretations are mentioned so that meaningful recommendations and conclusions can be drawn. The analysis was performed using the various data mining techniques like: (1) Two-step cluster analysis this technique is used for identifying clusters of customers based on their homogeneous groupings drawn from an, otherwise, set of heterogeneous customer data base (2) Chi-Square test- this is intended to test how likely it is that an observed distribution is due to chance. It is also called a "goodness of fit" statistic (3) vi
Factor analysis this technique is used for data preprocessing and for reducing the data to a manageable level which can be used for further analysis such as modeling and suitable interpretation; and (4) Multiple regression analysis- this predictive data mining modeling technique is used to predict the dependent variable (in this case Store Loyalty ) on the basis of the independent variables (in this case Store image dimensions/ attributes ). Finally, in the end, this thesis provides the findings, recommendations and future scope of the study. The customer groups identified are store-loyals and store non-loyals. The nonloyals present a significantly large group and retailers need to understand the typical profile of such customers so that suitable strategies can be formulated targeting them. The importance of various customer variables has also been identified. The six salient store attributes dimensions that have emerged have been discussed and suggestions have been put forth for the benefit of retailers and for future research. vii
LIST OF ABREVIATIONS AI ANOVA BI CART CHAID CRIS CRM DBMS DM EDI EIS FDI GRDI IR KDD MANOVA MHI OLAP Artificial Intelligence Analysis of Variance Business Intelligence Classification and Regression Tree Chi-Square Automatic Interaction Detection Consumer Image of Retails Stores Customer Relationship Management Database Management System Data Mining Electronic Data Interchange Executive Information System Foreign Direct Investment Global Retail Development Index Information Retrieval Knowledge Discovery in Databases Multivariate Analysis of Variance Monthly Household Income Online Analytical Processing viii
PCA PLs RDBMS RFID RFM RIS SPSS SQL VAT VSM Principal Component Analysis Private Labels Relational Database Management System Radio Frequency Identification Device Recency, Frequency, Monetary Retail Information System Statistical Package for Social Science Structured Query Language Value Added Tax Vector Space Model ix
LIST OF FIGURES Figure No. Figure Description Page No. Figure 3.1 Steps of Knowledge Discovery in Databases 45 Figure 3.2 Phases of Data Mining Life Cycle 47 Figure 3.3 Predictive Modeling through Linear Regression 56 Figure 3.4 Nearest Neighbors for Three Unclassified Records 59 Figure 3.5 Discovering Clusters and Descriptions in a Database 60 Figure 3.6 Hierarchical clustering 61 Figure 3.7 Decision Tree for Cellular Telephone Industry 63 Figure 3.8 Structure of a Neural network 65 Figure 3.9 A Simplified View of Neural Network 65 Figure 3.10 Neural Network for Prediction of Loyalty 66 Figure 4.1 Steps in Factor Analysis 86 Figure 4.2 Steps for Multiple Regression Analysis 92 Figure 5.1 Graphical Representation of Cluster Distribution 101 Figure 5.2 Within Cluster Percentage of Gender 104 Figure 5.3 Chi- Square - Gender 104 Figure 5.4 Within Cluster Percentage of Age 105 Figure 5.5 Chi- Square - Age 105 x
Figure 5.6 Within Cluster Percentage of Occupation 106 Figure 5.7 Chi- Square - Occupation 107 Figure 5.8 Within Cluster Percentage of Education 108 Figure 5.9 Chi- Square - Education 108 Figure 5.10 Within Cluster Percentage of MHI 109 Figure 5.11 Chi- Square - Income 110 Figure 5.12 Within Cluster Percentage of Shop-with 111 Figure 5.13 Chi- Square Shop-with 111 Figure 5.14 Within Cluster Percentage of Spend 112 Figure 5.15 Chi- Square - Spend 112 Figure 5.16 Within Cluster Percentage of Trips 113 Figure 5.17 Chi- Square - Trips 114 Figure 5.18 Relative Importance of Demographic and Behavioral Variables 114 Figure 5.19 Scree Plot 128 Figure 5.20 Component Plot in Rotated Space 132 Figure 5.21 Figurative Description of Store Loyalty- Predictive Model 141 xi
LIST OF TABLES Table No. Table Detail Page No. Table 1.1 Steps in the Evolution of Data Mining 6 Table 1.2 KDnuggets : Polls: Data Mining Software (May 2008) 13 Table 4.1 Questions to measure Loyalty 71 Table 4.2 Summarized Sample Statistics 73 Table 4.3 Sample Descriptive with Coding 74 Table 4.4 Chi-Square Test Illustration 81 Table 4.5 Color Preference by Customers for Car Dealership 83 Table 4.6 Directions for Setting up Worksheet for Chi-Square 84 Table 5.1 Auto-Clustering 100 Table 5.2 Cluster Distribution 101 Table 5.3 Store Loyalty amongst Surveyed Customers 102 Table 5.4 Profiling of Cluster by Gender 103 Table 5.5 Profiling of Cluster by Age 104 Table 5.6 Profiling of Cluster by Occupation 105 Table 5.7 Profiling of Cluster by Education 107 Table 5.8 Profiling of Cluster by Income 109 Table 5.9 Profiling of Cluster by Shop-with 110 Table 5.10 Profiling of Cluster by Expenditure 112 xii
Table 5.11 Profiling of Cluster by Trips 113 Table 5.12 Summary of Demographic/ Behavioral variables sample distribution and cluster membership 115 Table 5.13 Descriptive Statistics 118 Table 5.14 Correlation Matrix 120 Table 5.15 Anti-Image Matrix 122 Table 5.16 KMO and Bartlett s Test 124 Table 5.17 Communalities 126 Table 5.18 Total Variance Explained 127 Table 5.19 Component Matrix 129 Table 5.20 Rotated Component Matrix 130 Table 5.21 Short- Listed Attributes (Factor Loadings above.40) 131 Table 5.22 Component Transformation Matrix 132 Table 5.23 Factor Score Coefficient Matrix 133 Table 5.24 Factor Analysis of Grocery Store Attribute: Interpretation of Factors 134 Table 5.25 Reliability Analysis of Factors 135 Table 5.26 Variables Entered/ Removed 137 Table 5.27 Model Summary 138 Table 5.28 ANOVA 139 Table 5.29 Regression Coefficients 140 xiii