Customer Satisfaction Analysis

Size: px
Start display at page:

Download "Customer Satisfaction Analysis"

Transcription

1 Master Thesis in Statistics, Data Analysis and Knowledge Discovery Customer Satisfaction Analysis Laura Funa

2

3 LiU-IDA-IDA/STAT-A--11/002--SE

4 Abstract The objective of this master thesis is to identify key-drivers embedded in customer satisfaction data. The data was collected by a large transportation sector corporation during five years and in four different countries. The questionnaire involved several different sections of questions and ranged from demographical information to satisfaction attributes with the vehicle, dealer and several problem areas. Various regression, correlation and cooperative game theory approaches were used to identify the key satisfiers and dissatisfiers. The theoretical and practical advantages of using the Shapley value, Canonical Correlation Analysis and Hierarchical Logistic Regression has been demonstrated and applied to market research. i

5 ii

6 Acknowledgements This work would not have been completed without support of many individuals. I would like to thank everyone who has helped me along the way. Particularly: Prof. Anders Nordgaard and Malte Isacsson for providing guidance, encouragement and support over the course of my master s research. Prof. Anders Grimvall for serving on my thesis committee and valuable suggestions. Volvo Car Corporation for providing the data. Lastly, to everyone else without whose support none of this would have been possible. iii

7 iv

8 Table of contents 1 Introduction Background Objective Data Raw data Secondary data Assessment of data quality Univariate Analysis of the Satisfaction Attributes Univariate Analysis of Problem Areas Methods Kano Modeling Shapley Value Regression Assessing Importance in a Regression Model Potential, Value and Consistency Shapley-based R 2 Decomposition Choosing key-drivers Trend Analysis The time consistent Shapley value Hierarchical Logistic Regression Modeling Ordinary logistic regression model Hierarchical logistic regression Canonical Correlation Analysis Formulation Issues and practical usage Computations and Results Shapley Value Ranked Satisfiers (related to the satisfaction with the dealer) Ranked Satisfiers (related to the satisfaction with the vehicle) Ranked Dissatisfiers Key attributes identification

9 4.2 Time Series and Trend Analysis Hierarchical Logistic Regression: SAS Modeling Canonical Correlation Analysis Discussion and conclusions Proposed further research Kernel Canonical Correlation Analysis Moving Coalition Analysis Literature and sources Appendix A: SAS and R codes Appendix B: Outputs

10 Index of tables Table 1: Datasets Summary... 8 Table 2: Frequencies and proportions of the satisfaction attributes (Country A, Year 2006) Table 3: Problem areas occurance (Country A, Year 2006) Table 4: Dealer Satisfiers, Country A, Years 2006 and 2007 respectively Table 5: Dealer Satisfiers, Country A, Years 2008 and 2009 respectively Table 6: Dealer Satisfiers, Country A, Year Table 7: Vehicle satisfiers, Country A, Years 2006 and 2007 respectively Table 8: Vehicle Satisfiers, Country A, Years 2008 and 2009 respectively Table 9: Vehicle Satisfiers, Country A, Year Table 10: Vehicle Satisfiers, Country A, Years 2006 and 2007 respectively (respondents with no problems) Table 11: Vehicle Satisfiers, Country A, Years 2008 and 2009 respectively (respondents with no problems) Table 12: Vehicle Satisfiers, Country A, Year 2010 (respondents with no problems) Table 13: Dissatisfiers, Country A, Year 2006 and 2007 respectively Table 14: Dissatisfiers, Country A, Year 2008 and 2009 respectively Table 15: Dissatisfiers, Country A, Year Table 16: Ven problem area sub-categories, Country A, Year Table 17: Bulding the GLIMMIX procedure Table 18: Country A, Year Table Table 20: Solution for fixed effects Table Table Table

11 Index of figures Figure 1: Frequency distribution of variable V191 in Country A, Year Figure 2: Proportions of problem areas Figure 3: Kano Model Attributes Figure 4: Two-level hierarchical regression Figure 5: Satisfaction Attribute V90, Country A, Year Figure 6: Noise-Reach table, Country A, Year Figure 7: Time Series Analysis, Country A Figure 8: Trend in V1, Country A Figure 9: Time Series Analysis, Country A (respondents with no problems) Figure 10: Trend in V1, Country A (respondents with no problems) Figure 11: Trend in V8, Country A, all respondents vs. only those with no problems Figure 12: Trend in V10, Country A, all respondents vs. only those with no problems Figure 13: Trend in V17, Country A, all respondents vs. only those with no problems Figure 14: Time Series Analysis, problem areas, Country A Index of equations Equation 1: Regression Model Equation 2: R-squared Equation 3: Nash Equlibrium Equation 4: Marginal contribution of a player in a game Equation 5: Potential Function Equation 6: Differences Equation 7: Payoff Equation 8: Shapley Value Equation 9: Regression model Equation 10: Variance Equation 11: Relative contributions Equation 12: Marginal Effect Equation 13: Shapley Value R-squared decomposition Equation 14: Fields R-squared decomposition Equation 15: Success Equation 16: Ordinary logistic regression model Equation 17: Random effects Equation 18: Fixed effects Equation 19: CCA parameter

12 1 Introduction Predictive analytics rely on different statistical techniques deriving from fields such as data mining, modeling and game theory. The main reason for using these is to extract information from large and complex datasets and use it to forecast future trend patterns. In business, predictive models search for patterns and hidden relationships in historical or transactional data to serve as a guide for decision making and identifying risk and opportunities. Several data mining techniques have been developed and showed positive impact over the years in the large range of business fields. The most well-known applications include applications in finance (e.g. credit scoring), marketing, and fraud detection. Each field by itself then offers an enormous amount of possibilities where the analysis of the large datasets can be exploited in a profitable fashion. Taking marketing into consideration; among the main topics covered by predictive analytics are CRM (Customer Relationship Management), cross-selling, customer retention and direct marketing. Moreover, achieving these goals is in general based on conducting an appropriate customer analysis. A large portion of the data required by customer analytics is more than often acquired conducting customer satisfaction surveys. Customer satisfaction is a well known term in marketing; it indicates how well the products or services provided by the supplier meet the customer expectations. In a highly competitive market, the companies may take advantage of such information to differentiate or improve their products or/and services in order to increase their (market) share of customers and customers loyalty. Such data is among the most frequently collected indicators of market perceptions. The purpose of this master thesis is to develop an appropriate customer satisfaction analysis procedure that will provide an indicator of customer behavior on a highly competitive automotive market. Thus, the aim of the analysis is to find hidden relationships, patterns and trends in the datasets provided. The master thesis is divided into five chapters, starting with stating the objectives and the motivation of the problem addressed, followed by description and assessment of data quality, 5

13 including data sources, raw data and secondary data. The third part consists of the methods used and model building, while the last part is focusing on the results and discussion of the latter. The research is concluded with a critical assessment of the results obtained and the adequacy of the methods used. 1.1 Background The research is based on a customer satisfaction survey performed on new car owners (i.e. owners of cars that are three months in service). The survey was conducted on several different markets and consists of different areas of customer characteristics, customer satisfaction and related issues. As cars are consumer products, automotive businesses are driven by customer satisfaction. Hence an improvement in consumer insight and information gain through customer data is sought for constantly. It is essential to mention that satisfaction is a very abstract concept and the actual state of satisfaction varies between different individuals and different products or services. It depends on several psychological and physical variables. Additional options or alternative products and services that are available to customers in particular industry can be too seen as a source of variability of the satisfaction level. Most valuable satisfaction behaviors to investigate are loyalty and recommend rate. The main purpose of customer satisfaction analysis is often, understanding the impact of explanatory variables on the overall dependent variable. This means that a list of priority items, that can be improved, needs to be established, since the improvement in any of these will have a positive impact on overall satisfaction or customer loyalty and retention (Tang & Weiner, 2005). When choosing an appropriate statistical technique it is necessary to have a clear vision whether the purpose of the analysis is solely exploratory or predictive. Some of the most common customer satisfaction techniques include; ordinary least squares, Shapley value regression, penalty & reward analysis, Kruskal s relative importance, partial least squares and logistic regression. Since customer satisfaction studies are usually tracking studies, the results can be monitored over time and allow for trend detection. Moreover, one of the challenges when choosing the methodology and building the model is to assure that the results are consistent when tracking market over time (Tang & Weiner, 2005). 6

14 1.2 Objective The main objective of the master thesis is to find appropriate statistical techniques that present good applications in customer satisfaction analysis. Furthermore, they should provide tools for identifying key drivers, patterns, relationships among several sets of (dependent and independent) variables and measures of relative importance. More specific objectives of the thesis are; finding an exact measure of the contributions of the explanatory variables to the dependent variable and identifying the greatest satisfiers and dissatisfiers influencing customer satisfaction with the dealer and with the vehicle. Exploring the nature of the satisfaction attributes and evaluating whether there is a possibility to establish a clean measure of experienced problems and consequentially classify them into fixable and those that cannot be repaired, but are a matter of customers personal preferences (i.e. annoying concept ). Finally, examining the relationships between two sets of variables (i.e. satisfaction related problems and satisfaction attributes). The thesis aims to be consistent with the most commonly used customer satisfaction analysis techniques and available literature. What can and cannot be modeled and predicted needs to be clearly stated at all points. 2 Data The data used in this research was collected by conducting a customer satisfaction survey among new car owners (i.e. customers who had purchased a new car within three months). The questionnaire was divided into twelve different sections; ranging from personal, demographical questions to questions directly connected to satisfaction with the new car, previous cars and views on automotive industry. Data on customer satisfaction is often taken as a key performance indicator within business and is often incorporated in balance scorecards. An important basic requirement for effective research on customer satisfaction is building an appropriate questionnaire that provides reliable and representative measures. The general guideline is to build questions on whether the product or service has met or exceeded expectations. Expectations and consequently customer perceptions are therefore the key factor behind satisfaction. 7

15 Questions are based on individual level perceptions but are usually reported on aggregate level. According to Batra and Athola (Batra & Athola, 1990) customers purchase products and services based on two types of benefits; hedonic and utilitarian. The first is connected to experiential attributes and the latter is linked to the functional attributes of the product. The survey used in this research, involved most common measures of customer satisfaction; sets of statements using Likert technique and scales (Likert, 1932). 2.1 Raw data The data provided was based on the survey conducted in four different countries (A, B, C and D) and ranging over five years 2006 to 2010, except for country C, where the survey was conducted every second year (i.e. 2001, 2003, 2005, 2007, 2009) Country/Year Table 1: Datasets Summary Number of Variables Recorded responses A A A A A B B B B B C C C C C D D D D D TOTAL /

16 In total there were responses and data-points. The number of variables in each survey ranged from 301 to 426. The results from the development of the methodology are based on a survey 1 in the country A that included responses and 394 variables in the year In the trend analysis the datasets included all five years. The survey in question yielded valid responses, representing 76% of all customers who participated. The variables used in the core part of the analysis were 34 satisfaction attributes and 14 problem areas, where each problem area consists in general of 20 sub-categories. The satisfaction attributes were evaluated on a 1 to 10 scale, where 1 represented the worst and 10 the best possible outcome. Problem areas on the other hand allowed for several nominal values. 2.2 Secondary data Since the survey comprised several questions that allowed more than one answer (e.g. problem areas), the first step of the analysis was to transform these into binary form, using dummy variables. However, various variables in the research posed a bigger challenge and required further investigation to decide whether they should be treated as being ordinal or interval. Variables ranked on a never, occasionally, sometimes, always scale present a problem on relative placement of the two middle categories, thus Knapp (Knapp, 1990) argues that this produces a less-than ordinal scale. The controversy arises from the key terms such as appropriateness and meaningfulness. Conservative views (Siegel, 1956) are based on the assumption that once the ordinal level has been adopted, the inferences are restricted to population medians and non-parametric procedures must be used, hence the power of the statistics is lower. Labovitz (Labovits, 1967, pp ) on the other hand argues that there are no true restrictions in using parametric procedures for ordinal scales, since the assumption of the validity of the t and F distributions do not include the type of the scale, which consequentially provides statistics of higher power. 1 The methodology and models developed in this research were then re-applied to the remaining datasets and the results can be found in the Appendix C. 9

17 The number of the categories building the scale is important too. The remaining variables varied in scale level and the two types of scales occurring were a 1 to 4 scale and 1 to 10 scale, where the latter tends to continuize things more than the first. Moreover, there have been several studies (Hausknecht, 1990) on measurement scales in customer satisfaction analysis, which attempt to prove the validity of treating an ordinal scale with several categories as interval. 2.3 Assessment of data quality The quality and the nature of the data provided was first assessed by applying an univariate approach; identifying the distributions, response rate and percentage of missing values. As a last step of this pre-analysis, the most common issues when dealing with customer satisfaction data were pointed out Univariate Analysis of the Satisfaction Attributes Figure 1: Frequency distribution of variable V191 in Country A, Year

18 Table 2: Frequencies and proportions of the satisfaction attributes (Country A, Year 2006) Scale V191 V14 V193 V7 V3 1 0,41% 0,06% 0,24% 0,07% 0,18% 2 0,27% 0,04% 0,14% 0,07% 0,16% 3 0,57% 0,12% 0,22% 0,20% 0,32% 4 1,17% 0,31% 0,62% 0,63% 0,86% 5 1,42% 1,01% 1,11% 1,47% 1,72% 6 5,20% 3,61% 3,51% 5,13% 4,95% 7 8,71% 8,89% 8,90% 12,11% 11,94% 8 28,68% 24,65% 25,37% 27,32% 26,21% 9 29,09% 28,46% 31,62% 28,01% 27,42% 10 24,48% 32,85% 28,27% 24,99% 26,24% Total (responses) 94,78% 97,59% 97,53% 97,59% 97,54% Missing values 5,22% 2,41% 2,47% 2,41% 2,46% Scale V6 V8 V17 V23 V1 1 0,15% 0,11% 0,09% 0,27% 0,48% 2 0,17% 0,15% 0,05% 0,18% 0,27% 3 0,31% 0,29% 0,06% 0,33% 0,41% 4 0,98% 1,00% 0,19% 1,06% 0,75% 5 2,18% 2,25% 0,82% 2,68% 1,00% 6 6,51% 7,33% 2,77% 8,44% 2,64% 7 14,35% 14,31% 8,37% 14,38% 6,75% 8 27,12% 27,57% 24,16% 25,95% 19,63% 9 25,39% 26,08% 29,59% 22,60% 28,05% 10 22,83% 20,90% 33,91% 24,11% 40,03% Total (responses) 97,36% 96,96% 96,14% 96,99% 94,15% Missing values 2,64% 3,04% 3,86% 3,01% 5,85% Scale V12 V15 V208 V9 V19 1 0,10% 0,09% 0,27% 0,13% 0,15% 2 0,06% 0,07% 0,24% 0,08% 0,19% 3 0,14% 0,16% 0,47% 0,16% 0,34% 4 0,57% 0,49% 1,41% 0,49% 1,14% 5 1,38% 1,06% 2,16% 0,92% 1,81% 6 4,49% 3,65% 5,95% 3,08% 5,31% 7 11,25% 9,54% 11,64% 8,98% 11,12% 8 27,18% 25,44% 24,79% 25,16% 25,76% 9 27,95% 29,00% 25,96% 29,17% 26,68% 10 26,88% 30,50% 27,11% 31,83% 27,50% Total (responses) 97,35% 97,52% 94,37% 97,46% 97,41% Missing values 2,65% 2,48% 5,63% 2,54% 2,59% 11

19 Scale V211 V4 V16 V13 V11 1 0,52% 0,28% 0,13% 0,07% 0,12% 2 0,35% 0,19% 0,14% 0,07% 0,10% 3 0,64% 0,24% 0,28% 0,11% 0,22% 4 1,61% 0,70% 0,92% 0,45% 0,78% 5 5,63% 2,56% 2,06% 1,15% 1,72% 6 10,72% 5,81% 5,86% 3,94% 5,22% 7 16,59% 12,44% 13,22% 11,09% 13,18% 8 24,61% 24,94% 27,81% 27,74% 28,08% 9 19,90% 24,13% 25,38% 27,96% 26,36% 10 19,44% 28,71% 24,19% 27,43% 24,22% Total (responses) 90,23% 94,66% 97,29% 97,26% 96,84% Missing values 9,77% 5,34% 2,71% 2,74% 3,16% Scale V20 V26 V10 V22 V2 1 0,13% 0,10% 0,19% 0,18% 0,14% 2 0,11% 0,09% 0,16% 0,15% 0,09% 3 0,22% 0,16% 0,31% 0,28% 0,25% 4 0,64% 0,77% 0,87% 1,11% 0,73% 5 1,75% 1,47% 1,37% 1,86% 1,45% 6 5,34% 5,09% 4,41% 5,67% 4,47% 7 12,42% 13,01% 10,06% 12,32% 11,42% 8 27,76% 27,95% 25,13% 27,12% 27,57% 9 26,63% 25,78% 27,97% 26,06% 27,79% 10 25,02% 25,57% 29,52% 25,24% 26,09% Total (responses) 94,84% 97,03% 97,69% 96,91% 97,50% Missing values 5,16% 2,97% 2,31% 3,09% 2,50% Scale V221 V222 V223 V18 1 0,06% 0,26% 0,19% 0,09% 2 0,09% 0,27% 0,18% 0,13% 3 0,15% 0,56% 0,41% 0,23% 4 0,54% 1,80% 1,19% 0,73% 5 1,51% 3,21% 2,68% 1,78% 6 4,45% 8,24% 6,92% 5,37% 7 10,95% 12,70% 13,33% 12,67% 8 25,89% 23,91% 26,06% 26,89% 9 27,23% 23,09% 24,06% 26,05% 10 29,15% 25,97% 24,97% 26,05% Total (responses) 97,50% 97,43% 97,16% 97,52% Missing values 2,50% 2,57% 2,84% 2,48% Taking into consideration the most favorable rating scores; meaning that the attribute scores were at least very satisfied (i.e. 7) the above tables illustrate that 77,5% to 96% of the customers were at least very satisfied on at least one of the satisfaction attributes. The lowest satisfaction 12

20 score was associated with V202 with 77,5%, however it still represents a majority attitude. The lowest response rate was associated with the attribute V211 with missing rate at 9,8% Univariate Analysis of Problem Areas Total of problems appeared, meaning that 43,3% of all respondents experienced at least one problem. The below tables represent frequencies of the individual problem areas. The most common problems appear in the Vel category with 10,9% occurrence. The least common are Vs problems. Table 3: Problem areas occurance (Country A, Year 2006) Problem Area Vp Ve Vw Vb Vo Vi Vel Number of experienced problems % in the total population 5,20% 4,24% 1,78% 8,20% 7,61% 8,76% 10,85% Problem Area Ven Vcl Vbr Vsw Vs Vex Vot Number of experienced problems % in the total population 5,91% 3,78% 3,57% 2,93% 1,21% 0,88% 1,03% 13

21 Figure 2 represents the proportion of each problem area. Figure 2: Proportions of problem areas A very common challenge when dealing with customer satisfaction data is how to overcome the problem of multicollinearity. It can be controlled and avoided by a well-designed questionnaire, however, in most cases this is difficult to achieve. The attributes measured in the survey were in general highly correlated with each other. An example of such problem would be when evaluating the dealer where the car was purchased; the dealers ability to solve problems is highly correlated with the dealers friendliness. Another issue relates to dealing with customer satisfaction data that is of tracking nature. It is challenging to reassure that the results obtained reflect real changes in the market and not just a small number of respondent checking different satisfaction levels (e.g. 8 instead of 9). It is important to note that the percentage of customer who had experienced at least one problem, but are still at least very satisfied is 84,5%, which is the majority portion. Adding this imbalance to the fact that the nature of the survey is offering only very scarce information on problem areas, this may lead to several restrictions when analyzing the latter. Since the objectives of the thesis involve a deep analysis of the experienced problems, more appropriate measures 14

22 should be provided by further expansion and development of the things go wrong section of the questionnaire. 3 Methods The main methods used in the research are: Kano Modeling; providing deeper understanding of the customer satisfaction data and what can be achieved using the available data. Shapley Value; overcoming the problem of multicollinearity, providing better regression results and allowing for trend analysis. Hierarchical logistic regression; exploring different, hierarchically ranked layers of the data. Canonical correlation; analyzing relationships between two different sets of variables. 3.1 Kano Modeling The theory of has been developed by professor Noriaki Kano (Mikulic & Prebežac, 2011, pp ) and involves product development and customer satisfaction. It classifies product attributes into five categories based on customer perceptions; enhancers, one-dimensional, must-be, indifferent and reverse. The theory states that the relationship between the performance of a product attribute and satisfaction level is not necessarily linear. Certain attributes can be asymmetrically related with satisfaction levels. These relationships are visually presented in figure 3. 15

23 Figure 3: Kano Model Attributes Where an attractive attribute provides satisfaction when it is fully implemented, the nonfulfillment of such does not, however cause dissatisfaction. Must-be attributes on the other hand results in dissatisfaction if not fulfilled, but the fulfillment does not increase satisfaction. Onedimensional attributes increase the satisfaction when implemented and dissatisfaction appears if the attribute is not fulfilled. Indifferent attributes do not affect the consumer satisfaction in any way, while reversal attributes result in customer dissatisfaction when fulfilled and satisfaction when not fulfilled (e.g. when technology that is difficult to understand and complicated to use or maneuver is implemented this may cause dissatisfaction). There are several advantages to integrate the Kano modeling; classification of attributes can be used to optimize and improve the products, discover the attractors and develop product differentiation. Moreover, attribute classification provides valuable help in prioritizing requirements and identifying attributes that need attention. An important measure to separate experienced problems with the product, from those that can be fixed and those that are of 16

24 personal preference, may be introduced by using Kano modeling. The nature of the attractive and must-be attributes would allow applying attributable and relative risk techniques. Attributable risk measures the reduction in dissatisfaction that would be observed if the consumers would not experience a particular problem, compared to the actual pattern. Relative risk is a ratio of the probability of dissatisfaction occurring among the group of consumers that experienced a particular problem compared to the probability of the dissatisfaction occurring among the group of consumers that did not experience the problem. However, the data used in this research did not include any additional indicator on problem attributes, hence classification of these is problematic. 3.2 Shapley Value Regression Regression models offer a convenient method for summarizing and achieving two very different goals in data analysis. One is prediction and another is inference about interaction between the predictor variables and the outcome variable. Yet, regression models do not prove that such relationships exist, they simply summarize the likely effects if the models are as hypothesized (Lipovetsky & Conklin, 2001) Assessing Importance in a Regression Model Considering a simple model; Y f (X, ) Equation 1: Regression Model where all of the predictor variables x - are uncorrelated with each other; the standardized regression coefficients (called Beta coefficients - β) are taken as measures of importance. These measure the expected change in Y (i.e. dependent variable) when x changes by one standard deviation. Having a negative β (for one particular predictor) can present a potential complication. However, since the actual value of β is its absolute value and the sign represents the direction of the effect, β can be represented by either squaring the values or simply taking the absolute value. 17

25 The sum of the standardized coefficients is then equal to the overall R 2 of the model, where R 2 (named coefficient of multiple determination) is a measure of the overall quality of the fit of the model (Lipovetsky & Conklin, 2001). Hence, each individual squared coefficient can be interpreted as the percentage of the explained variance by that individual variable. R 2 i i ( f i ( y i y) y) 2 2 SS SS reg tot Equation 2: R-squared Nevertheless, the above explained situation almost never occurs in real data. Consequently, assessing standardized regression coefficients as explained above does not lead to a good indication of importance of each individual variable. The greater the correlation between the predictor variables the less meaningful the evaluated coefficients are (e.g. taking two variables with correlation of 1 into consideration; their coefficients would yield an infinite number of combination of predictors, each making exactly the same contribution). As a solution to this, I propose a technique used in Game Theory the Shapley Value Potential, Value and Consistency Shapley value, a solution concept in cooperative game theory, was introduced by Lloyd Shapley in 1953 (Shapley, 1953). It assigns a unique distribution of total surplus generated by the coalition of all players and it produces a unique solution satisfying the general requirements of the Nash equilibrium (i.e. choosing an optimal strategy under uncertainty) (Kuhn & Tucker, 1959). There is always exactly one such allocation procedure Nash Equilibrium Nash Equilibrium is a solution strategy in game theory (named after John Forbes Nash, who introduced it). It involves a game of two or more players, where each player is assumed to be aware of the equilibrium strategies x- * i. of other players and is making the best decision they can, taking into consideration the decisions of the remaining players. Moreover, none of the players can gain anything by changing their decision, if the decisions of the others remain 18

26 unchanged. The set of strategies chosen under such circumstances and its payoff then constitute the Nash Equilibrium. i, x i S, x x ; f ( x, x * * * i i i i i i Equation 3: Nash Equlibrium ) f ( x i i, x * i ) Where; (S, f) is a game of n players and S i is a strategy of player i S = S 1 xs 2 xs N is a set of strategy combinations where, f = (f 1 (x), f n (x)) is the payoff function for x S x i is a strategy combination for player i while x -i is a strategy combination for all players except player i Thus, when each player i chooses strategy x i, it follows that x = (x 1,.., x n ), and the resulting playoff for player i equals f i (x). Once player i cannot improve their payoff by changing their strategy, then the strategy has achieved x * i. Consequently the strategy combination x * S is the Nash Equilibrium Potential Theorem 1: There exists a unique real function on games called the potential such that the marginal contributions of all players (according to this function) are always efficient. Moreover, the resulting payoff vector is precisely the Shapley value (Econometrica, 1989). D i P( N, v) P( N, v) P( N \ {i}, v) Equation 4: Marginal contribution of a player in a game Where; N is a finite number of players v is a characteristic function satisfying v(φ) = 0 (N\{i}, v) is a subgame S D i P(N, v) is a payoff vector 19

27 Thus, a function P(N, v) is called the potential function if it satisfies the following for all games; i N i D P( N, v) v( N) Equation 5: Potential Function Moreover, the satisfaction of the above condition determines the uniqueness of the potential function. According to Hart & Mass-Colell (Hart & Mass-Colell, 1989, pp ), it follows that the potential function is such that the allocation of marginal contributions always adds up exactly to the grand coalition. This is referred to as efficiency. Furthermore, D i P(N, v) = Sh i (N, v); where Sh i denotes the Shapley value of player i in the game (N, v) Preservation of differences Preservation of differences looks at the payoff allocation problem from another view. That is, what would player i gain if player j is not be included and what would j get if player i would not be included in the model. Hart & Mass-Colell (Ibid.), show that one obtains a unique efficient outcome which simultaneously preserves all these differences. d ij i j x ( N \ {j}) - x ( N \ {i}) Equation 6: Differences Thus, i x ( N) i x ( N \ {i}) j x ( N) j x ( N \ {i}) Equation 7: Payoff The above equality has been used by Myerson (Myerson, 1980) and it has been proven that any solution that is obtained by a potential function satisfies the condition. Hence, any such solution clearly coincides with the Shapley value Consistency An important characterization of the value is its internal consistency property. Theorem 2: Consider the class of solutions that, for two-person games, divide the surplus equally. Then the Shapley value is the unique consistent solution in this class (Econometrica, 1989). 20

28 In general, the consistency requirement as stated above may be described with: Φ being a function that associates a payoff to every player in every game reduced game, among any group of players in a game, defined as: giving the payoff according to Φ to the rest of the players It follows that Φ is consistent if and only if, when applied to any reduced game, yields the same payoffs as in the original game (Econometrica, 1989) Value In regression, the attributes are thought of as players and the total value of the game as the R 2. The formulation of the Shapley value of a single attribute is defined as: SV v ) j k i k ( M i j ) v( M i j( j) Equation 8: Shapley Value Where; v(m i j ) is the R 2 of a model i containing predictor j v(m i j(-j) ) is the R 2 of the same model i without j k k!( n k 1)! ; is a weight based on the number of predictors in total (n) and the n! number of predictors in this model (k) Shapley-based R 2 Decomposition Shapley value offers very robust estimate of the relative importance of predictor variables even when there are high levels of correlation and/or skewness in the data. The most common approach to R 2 decomposition in cases of multicollinearity is a stepwise regression and its procedures. However, this method is of arbitrary nature and it does not always lead to efficient conclusions. Moreover, the significance test does not always allow the ranking of the independent variables in order of importance (Israeli, 2007, pp ). An alternative approach has been proposed by Chantreuil and Trannoy (Chanteruil & Trannoy, 1999), who used the concept of the Shapley value. Shorrocks (Shorrocks, 1999) then argues that Shapley value based procedures can be applied in various situations, leading to different results. 21

29 While traditional decompositions such as Fields (Fields, 2003) decomposition, can be applied to simple linear regressions models and perform well in finding the effects of the explanatory variables, the new approach (i.e. Shapley value based approach) may also be applied to more complicated regression models. These may include interactions, dummy variables and high multicollinearity between explanatory variables Decomposing R 2 Consider a regression model; y a J j 1 b j x j e Equation 9: Regression model where the total sum of squares (in essence the raw variance of y) can be decomposed into the model sum of squares (SS reg ) and the error sum of squares (SS error ): ( y) SS Var( yˆ) Var( e) Var tot Equation 10: Variance The R 2 of the regression is then taken as previously stated: R 2 SS SS reg tot Following the Mood, Graybill and Boes (Mood et al., 1974) theorem the relative contributions may be stated as: Var( y) J j 1 Cov ( b x, y) j j Cov ( e, y) Equation 11: Relative contributions Omitting the residuals it follows that: 2 R ( y) J j 1 b Cov ( x j Var( y) j, y) 1 Cov ( e, y) Var( y) 22

30 23 Continuing from the above equation, the explanatory variables can be ranked according to their importance. However, this fails to account for probable correlation between the contribution of an individual explanatory variable and that of the remaining variables. On the other hand, Shapley decomposition procedure requires the contribution of a variable being equal to its marginal effect. The marginal effect can be expressed as: * * * 2 2 e x b a y R e x b x b a y R M S j j j S j k k j j k Equation 12: Marginal Effect Where; S is a subgroup of explanatory variables not including variable k. Taking a simple example into consideration, where e x b x b a y , the difference of the two decompositions may be seen from the following: Shapley decomposition: ) ( ) ( ) ( 2 1 * * 1 ** 1 ** 2 * 2 * 2 * e x b a R e x b a R e x b x b a R C ) ( ) ( ) ( 2 1 * 2 * 2 * 2 ** 1 ** 1 ** e x b a R e x b a R e x b x b a R C Equation 13: Shapley Value R-squared decomposition Fields decomposition: ) ( ), ( 2 ) ( ) ( ), ( 2 ) ( 2 * ** y Var y x Cov b b y Var y x Cov b b C ) ( ), ( 2 ) ( ) ( ), ( 2 ) ( 1 ** * y Var y x Cov b b y Var y x Cov b b C Equation 14: Fields R-squared decomposition Of special interest when comparing the two decompositions are models that are including high multicollinearity. This issue is particularly problematic when dealing with Fields decomposition due to the reason that it uses the estimated coefficients. The estimated variances of these will be large and consequentially the estimated coefficients will deviate largely from the population

31 coefficients. Moreover, a small change in the model can result in a large change in the estimated coefficients. In contrast, Shapley based decomposition uses marginal contributions of a variable from all sequences. The value of the contribution will be high or low depending on whether the variable to which the variable in question is correlated is already included in the model. Consequentially two strongly correlated variables will result in having similar contributions. Israeli (Israeli, 2006, pp ) then argues that it is possible to similarly treat cases where non-linear effects of a variable are included in the regression models and models where interacting variables are introduced. There is no evidence of Fields decomposition, how the contribution should be divided in such cases, while this represents no problem for Shapley decomposition Choosing key-drivers Up to this point, a method that successfully measures the relative importance of attributes in the model has been established. The following analytical design is proposed to effectively identify the key dissatisfiers (i.e. attributes that need attention). The notations used include: P(D) probability of dissatisfaction P(F) probability of failure by any of the independent attributes P(D F) conditional probability of dissatisfaction among failed P(D F ) conditional probability of dissatisfaction among non-failed P(F D) conditional probability of failure among those dissatisfied reach value P(F D ) conditional probability of failure among those non-dissatisfied noise value In general, it is possible to say that values on the several bottom levels (less than 5) on the ordinal satisfaction scale prove dissatisfaction (D) and an identified problem corresponds to failure (F). The opposite events; non-dissatisfaction and non-failure are denoted as D and F respectively. To identify the attributes that need attention, it is necessary to find the maximum values of the: Success Re ach Noise P( F D) P( F D') Equation 15: Success 24

32 This is a measure of the prevalence of failed respondents, among those who are dissatisfied, in comparison with failed respondents, among those non-dissatisfied. Considering a situation where all the attributes are ordered by their Shapley values in descending order and corresponding reach and noise values are given. According to Conklin and Lipovetsky (Conklin & Lipovetsky, 2004), adding the second ranked attribute to the model along with the first one; will increase the reach function (i.e. the failure on either of the two attributes increases the amount of dissatisfied customers). However, the noise function increases correspondingly (i.e. the non-dissatisfied ones). Adding more attributes results in the same pattern. In general, reach means reassuring that a large part of the total number of dissatisfied customers are taken into consideration (which needs to be maximized), while a large noise number would mean focusing on problems that are not actual causes of dissatisfaction (Conklin & Lipovetsky, 2004). Once added noise overwhelms the added reach, when including the next attribute into the model, success begins to decrease. At that point the final set of key dissatisfiers is defined (Conklin & Lipovetsky, 2004). 3.3 Trend Analysis Using the Shapley value as the measure of importance, allows us to track market over time. The differences between two waves are due to actual changes in the market The time consistent Shapley value The Shapley value is one of the most commonly used sharing mechanisms in static cooperation games with transferable payoffs (Yeung, 2010, pp ). Actually, the time-consistency property of the Shapley value means that if one renegotiates the agreement at any intermediate instant of time, assuming that cooperation has prevailed from initial date until that instant, then one would obtain the same outcome (Petrosjan & Zaccour, 2001, pp ). Thus, taking this property allows us to compare the marginal contribution of each satisfaction attribute over time. 25

33 3.4 Hierarchical Logistic Regression Modeling A hierarchical logistic regression model is proposed to examine data with group structure and a binary response variable. There group structure is usually characterized by two levels; micro and macro. The structure is visually presented in the figure 4. Figure 4: Two-level hierarchical regression The same variables, predictors are used in each context, but the micro predictors are allowed to vary over context. At the first (micro) level, ordinary logistic regression model is applied. At the second (macro) level the micro coefficients are treated as functions of macro predictors. A Bayes estimation procedure is used to estimate the micro and macro coefficients. The components of the model represent within- and between- macro variance. An algorithm for finding the maximum likelihood estimates of the covariance of the components is proposed. The make-model car is viewed as macro observations and individual cars as micro. Dai, Li and Rocke (Dai et al., NN) propose the following procedure Ordinary logistic regression model Let y be a binary outcome variable (i.e. the customer is satisfied or dissatisfied) that follows Bernoulli distribution y ~ Bin (1, π) and x be a car level predictor. Then the model can be written as: y ij ij e ij logit(π ij ) = log( 1 ij ij ) x ij Equation 16: Ordinary logistic regression model 26

34 Where; - i = 1 I j is the car level indicator and - j = 1 J is the make-model level indicator - ij is the probability of dissatisfaction for car i among make-model j, conditional on x Assumptions made in this model are that the micro level random errors e ij are independent with 2 moments E(e ij ) = 0 and Var(e ij ) = (1 ) e ij ij Hierarchical logistic regression Extending the ordinary model and accounting for effects of the second macro- level may be done by including design variables (dummy variables). Each second level unit (i.e. each make-model unit) has its own intercept in the model. These intercepts are used to measure the differences between make-models. logit(π ij ) = α j + βx ij where α j is the make-model intercept and its effect can be either fixed or random (Domidenko, 2004). For simplicity purposes it is possible to treat the effects as random and re-write the model as following: logit(π ij ) = α j + βx ij where α j = u j Equation 17: Random effects It is then possible to add second level predictors. The above equation will therefore be extended to: logit(π ij ) = α j + βx ij α j = α + γz j + u j Equation 18: Fixed effects Where the added term γ is a fixed effect and z is the second level predictor. Using the same predictors, the model can be extended further for investigation of possible cross-level interaction. The algorithm can be applied using SAS procedure PROC GLIMMIX. 27

35 3.5 Canonical Correlation Analysis Canonical correlation has been introduced by Harold Hotelling (Johnson & Wichern, 2001) and is a way of exploring the cross-covariance matrices. Consider two sets of variables x 1,, x n and y 1,,y m and assume there are correlations among these variables. Then the canonical correlation analysis will result in finding combinations of x s and y s which have maximum correlation with each other Formulation Given vectors; X = (x 1,, x n ) and, Y = (y 1,, y n ) Let; xx YY cov( X, X ) and, cov( Y, Y) The parameter to maximize is; a' xy b a' xx a b' YY Equation 19: CCA parameter b Following: The canonical variables are defined by; U = a X V = b Y Issues and practical usage The main benefit of using the canonical correlation analysis is its diversification from other (appropriate) multivariate techniques that impose very rigid restrictions. It is generally believed that those provide results of higher quality. However, for the purpose of this research and when dealing with this type of data, the fact that canonical correlation places the fewest restrictions makes it the most appropriate and powerful multivariate technique. It may be seen as a generalization of the multiple linear regression. 28

36 Variables included in the analysis should be on ratio or interval scale. However nominal or ordinal variables can be used after converting them to sets of dummy variables. Even though testing significance of the canonical correlations requires data to be multivariate normal, the technique performs well for descriptive purposes even if the requirement is not necessarily fulfilled. Hair (Hair et al.,1998) discusses the flexibility of the canonical correlation and its advantages, particularly in the context when the dependent and explanatory variables can be either metric or non-metric. Hence, the application is broadly consistent with existing literature. 4 Computations and Results The very first step when conducting the analyses was using SAS statistical software to transform the variables that allowed more than one answer (e.g. problem areas) into binary form by adding dummy variables Shapley Value I used R statistical language more specifically The Package relaimpo (Relative Importance for Linear Regression in R). This package implements six different metrics for assessing relative importance of predictors in the linear model. Moreover, it offers exploratory bootstrap confidence intervals (Journal of Statistical Software, 2006). For the purpose of this research, there are three particularly useful metrics; lmg, first and last, described in the following lmg ; these are the Shapley Values. The metric is a decomposition of R 2 into nonnegative contributions that automatically sum to the total R 2. It is recommended to use when calculating relative importance, since it uses both direct effect and effects adjusted for other predictors in the model. First ; these are univariate R 2 values from regression models with one predictor only. They explain what each predictor individually is able to explain. If predictors are correlated the sum of all firsts will be high above the the overall R 2 of the model. 2 See Appendix A for SAS codes 29

37 Last ; these explain what each predictor is able to able to explain in addition to all other predictors. The values represent the increase in R 2 when the specific predictor is added to the model. In case of correlation among the predictors, summing lasts will not add up to the overall R 2. A potential drawback are computational difficulties, hence sampling of attributes is necessary. Theil (Theil, 1987) suggests that an information measure may be introduced, thus information coefficient was introduced as a pre-analysis step. Information coefficient is a measure for evaluating the quality and usefulness of attributes. Unavoidably, 20 vehicle related attributes were chosen in each dataset. The following analysis is based on the R-output 3 and includes relative importance of 15 satisfaction attributes regarding the dealer, where the vehicle was purchased, followed by 20 attributes regarding the vehicle, both ranging over 4 years. 3 See Appendix B 30

38 4.1.1 Ranked Satisfiers (related to the satisfaction with the dealer) Figure 5 is illustrating the frequency distribution of the response variable. Figure 5: Satisfaction Attribute V90, Country A, Year 2006 Tables 4 to 6 are displaying the lmg metrics of the attributes regarding the satisfaction with the dealer and are ordered according to their relative importance. Table 4: Dealer Satisfiers, Country A, Years 2006 and 2007 respectively lmg RI % lmg RI % V91 0, ,52% V91 0, ,10% V94 0, ,89% V94 0, ,61% V103 0, ,37% V103 0, ,89% V98 0, ,53% V198 0, ,80% V93 0, ,23% V93 0, ,44% V101 0, ,96% V101 0, ,92% V95 0, ,50% V95 0, ,51% V97 0, ,29% V97 0, ,12% V102 0, ,90% V96 0, ,72% V96 0, ,59% V102 0, ,62% V99 0, ,57% V99 0, ,41% V100 0, ,83% V92 0, ,12% V92 0, ,82% V100 0, ,72% 31

39 Table 5: Dealer Satisfiers, Country A, Years 2008 and 2009 respectively lmg RI % lmg RI % V91 0, ,76% V91 0, ,84% V94 0, ,89% V94 0, ,67% V103 0, ,75% V103 0, ,22% V98 0, ,55% V98 0, ,45% V93 0, ,52% V93 0, ,36% V101 0, ,11% V101 0, ,04% V95 0, ,54% V95 0, ,57% V97 0, ,28% V97 0, ,14% V102 0, ,96% V102 0, ,74% V96 0, ,62% V96 0, ,62% V99 0, ,43% V99 0, ,48% V92 0, ,99% V92 0, ,19% V100 0, ,61% V100 0, ,70% Table 6: Dealer Satisfiers, Country A, Year 2010 Lmg RI% V91 18,78% 18,78% V94 10,01% 10,01% V103 8,75% 8,75% V98 8,35% 8,35% V93 7,49% 7,49% V101 7,21% 7,21% V95 6,77% 6,77% V102 6,12% 6,12% V97 6,01% 6,01% V96 5,50% 5,50% V92 5,21% 5,21% V99 5,18% 5,18% V100 4,63% 4,63% 32

40 4.1.2 Ranked Satisfiers (related to the satisfaction with the vehicle) Tables 7 to 9 are illustrating satisfaction attributes regarding the vehicle and are ordered according to their relative importance. Table 7: Vehicle satisfiers, Country A, Years 2006 and 2007 respectively lmg RI % lmg RI % V1 0, ,53% V1 0, ,12% V2 0, ,93% V2 0, ,53% V3 0, ,65% V21 0, ,43% V4 0, ,24% V3 0, ,81% V5 0, ,02% V7 0, ,82% V6 0, ,98% V11 0, ,69% V7 0, ,90% V6 0, ,58% V8 0, ,52% V9 0, ,52% V9 0, ,37% V5 0, ,43% V10 0, ,24% V8 0, ,33% V11 0, ,17% V13 0, ,09% V12 0, ,99% V10 0, ,01% V13 0, ,87% V12 0, ,96% V14 0, ,84% V25 0, ,87% V15 0, ,75% V15 0, ,74% V16 0, ,45% V17 0, ,71% V17 0, ,42% V14 0, ,59% V18 0, ,30% V16 0, ,53% V19 0, ,10% V22 0, ,35% V20 0, ,74% V20 0, ,89% 33

41 Table 8: Vehicle Satisfiers, Country A, Years 2008 and 2009 respectively lmg RI % lmg RI % V1 0, ,68% V1 0, ,19% V23 0, ,18% V23 0, ,70% V2 0, ,15% V25 0, ,52% V21 0, ,71% V21 0, ,44% V15 0, ,97% V7 0, ,39% V7 0, ,96% V3 0, ,20% V3 0, ,93% V4 0, ,67% V24 0, ,76% V11 0, ,54% V4 0, ,52% V9 0, ,41% V8 0, ,42% V10 0, ,40% V10 0, ,35% V8 0, ,26% V9 0, ,27% V13 0, ,17% V11 0, ,20% V12 0, ,10% V17 0, ,91% V17 0, ,06% V13 0, ,89% V14 0, ,06% V5 0, ,79% V26 0, ,91% V14 0, ,76% V15 0, ,84% V25 0, ,62% V24 0, ,79% V12 0, ,57% V22 0, ,72% V16 0, ,35% V16 0, ,65% 34

42 Table 9: Vehicle Satisfiers, Country A, Year 2010 lmg RI % V27 0, ,28% V1 0, ,91% V2 0, ,71% V21 0, ,69% V7 0, ,80% V4 0, ,62% V8 0, ,51% V3 0, ,41% V9 0, ,02% V11 0, ,88% V13 0, ,74% V17 0, ,74% V10 0, ,71% V12 0, ,63% V15 0, ,52% V14 0, ,48% V25 0, ,39% V22 0, ,37% V16 0, ,36% V19 0, ,21% 35

43 Among customers that did not experience any problems The follow up analysis took a closer look on the customers, who did not experience any problems and compared the obtained relative importances to those obtained in the previous section where all the customers were included in the analysis. Tables 10 to 12 are displaying the satisfaction attributes regarding the vehicle, ordered according to their relative importance. Table 10: Vehicle Satisfiers, Country A, Years 2006 and 2007 respectively (respondents with no problems) lmg RI % lmg RI % V7 0, ,93% V14 0, ,26% V14 0, ,71% V2 0, ,77% V2 0, ,61% V7 0, ,75% V1 0, ,36% V11 0, ,66% V8 0, ,88% V1 0, ,62% V3 0, ,57% V8 0, ,40% V11 0, ,19% V3 0, ,25% V6 0, ,03% V10 0, ,11% V13 0, ,93% V13 0, ,01% V10 0, ,78% V6 0, ,69% V5 0, ,65% V9 0, ,61% V9 0, ,63% V25 0, ,55% V12 0, ,40% V29 0, ,54% V16 0, ,30% V5 0, ,51% V28 0, ,23% V22 0, ,43% V15 0, ,20% V12 0, ,36% V26 0, ,18% V26 0, ,34% V22 0, ,96% V15 0, ,08% V17 0, ,91% V17 0, ,04% V4 0, ,58% V30 0, ,03% 36

44 Table 11: Vehicle Satisfiers, Country A, Years 2008 and 2009 respectively (respondents with no problems) lmg RI % lmg RI % V2 0, ,51% V14 0, ,20% V7 0, ,61% V17 0, ,06% V14 0, ,51% V2 0, ,24% V10 0, ,77% V10 0, ,61% V8 0, ,38% V1 0, ,51% V3 0, ,28% V11 0, ,44% V1 0, ,02% V31 0, ,35% V11 0, ,99% V8 0, ,20% V31 0, ,84% V13 0, ,02% V26 0, ,83% V3 0, ,98% V9 0, ,80% V26 0, ,63% V12 0, ,75% V16 0, ,61% V17 0, ,72% V17 0, ,42% V13 0, ,68% V21 0, ,41% V21 0, ,51% V25 0, ,40% V16 0, ,22% V28 0, ,38% V25 0, ,17% V9 0, ,34% V15 0, ,11% V12 0, ,26% V28 0, ,98% V15 0, ,88% V27 0, ,32% V27 0, ,07% 37

45 Table 12: Vehicle Satisfiers, Country A, Year 2010 (respondents with no problems) lmg RI % V2 0, ,60% V7 0, ,50% V14 0, ,06% V8 0, ,77% V1 0, ,42% V31 0, ,33% V11 0, ,20% V13 0, ,11% V10 0, ,08% V3 0, ,95% V21 0, ,67% V170 0, ,56% V9 0, ,56% V26 0, ,54% V25 0, ,53% V22 0, ,49% V16 0, ,48% V15 0, ,18% V12 0, ,18% V27 0, ,80% From the above tables it is possible to notice that the contributions of the satisfaction attributes are very close in terms of importance. A number of new attributes, which were previously less important, entered the new model (e.g. V28, V29, V30 and V31). The importance of the attribute V14 increased greatly and is appearing on the top three list each year. 38

46 4.1.3 Ranked Dissatisfiers In contrast to the previous section, this part is focusing on the identification of the greatest dissatisfier. The Shapley value was calculated for all experienced problem areas, followed by analysis of problems in each problem area (i.e. sub-categories). Tables 13 to 15 are illustrating the problem areas ranked according to their relative importance. Table 13: Dissatisfiers, Country A, Year 2006 and 2007 respectively lmg RI % lmg RI % Ven 0, ,21% Ven 0, ,28% Vb 0, ,14% Vc 0, ,79% Vc 0, ,02% Vb 0, ,36% Vel 0, ,85% Vel 0, ,00% Vi 0, ,04% Vi 0, ,45% Vo 0, ,22% Vo 0, ,12% Vsw 0, ,07% Vsw 0, ,65% Vbr 0, ,56% Vw 0, ,15% Ve 0, ,22% Vbr 0, ,86% Vp 0, ,98% Vs 0, ,61% Vs 0, ,37% Vp 0, ,27% Vw 0, ,10% Ve 0, ,54% Vex 0, ,36% Vex 0, ,04% Vot 0, ,85% Vot 0, ,88% 39

47 Table 14: Dissatisfiers, Country A, Year 2008 and 2009 respectively lmg RI % lmg RI % Ven 0, ,25% Ven 0, ,43% Vb 0, ,03% Vc 0, ,10% Vc 0, ,33% Vb 0, ,25% Vel 0, ,32% Vel 0, ,86% Vi 0, ,76% Vi 0, ,95% Vo 0, ,99% Vs 0, ,56% Vw 0, ,20% Vbr 0, ,01% Vbr 0, ,85% Vo 0, ,97% Vp 0, ,59% Ve 0, ,89% Vs 0, ,21% Vsw 0, ,75% Vsw 0, ,93% Vp 0, ,73% Ve 0, ,30% Vw 0, ,76% Vex 0, ,58% Vex 0, ,57% Vot 0, ,66% Vot 0, ,17% Table 15: Dissatisfiers, Country A, Year 2010 lmg RI % Ven 0, ,29% Vc 0, ,62% Vb 0, ,94% Vel 0, ,70% Vi 0, ,56% Vo 0, ,88% Vbr 0, ,99% Vs 0, ,93% Vsw 0, ,12% Vp 0, ,93% Ve 0, ,63% Vw 0, ,83% Vex 0, ,49% Vot 0, ,08% 40

48 The analysis was then applied to sub-categories in order to identify the absolute dissatisfier. Table 16: Ven problem area sub-categories, Country A, Year 2006 lmg RI % Ve4 0, ,40% Ve1 0, ,44% Ve7 0, ,99% Ve98 0, ,51% Ve5 0, ,15% Ve8 0, ,19% Ve6 0, ,31% Ve18 0, ,92% Ve15 0, ,86% Ve9 0, ,83% Ve19 0, ,68% Ve2 0, ,63% Ve11 0, ,56% Ve16 0, ,34% Ve10 0, ,87% Ve27 0, ,77% Ve26 0, ,65% Ve22 0, ,36% Ve12 0, ,30% Ve14 0, ,10% Ve17 0, ,09% Ve3 0, ,05% 41

49 4.1.4 Key attributes identification Figure 6: Noise-Reach table, Country A, Year 2006 Figure 6 (above) is illustrating the key attributes identification. Problem areas are ranked according to the corresponding Shapley values and reach and noise are calculated according to the equation 15. Once added noise overcomes added reach, the cutting point is known. All problem areas with corresponding success less than 0 are unimportant. 4.2 Time Series and Trend Analysis Time series analysis in order to detect possible trend in relative importance was applied to those satisfaction attributes (in relation to the vehicle) that were repeating in the model over the 5 years. This is illustrated in figure 7. 42

50 Figure 7: Time Series Analysis, Country A According to the above chart, the relative importance of the satisfaction attribute V1 is showing most fluctuation over time, while the remaining attributes are fairly stable. Figure 8 shows fitted linear trend line, which illustrates the changes in satisfaction attribute V1 over five consecutive years of study. The R 2 represents trendline trustworthiness. Its value of 0,8153 confirms a fairly good fit of the line to the data. 4 Figure 8: Trend in V1, Country A 4 Trend fitted to the remaining variables is displayed in Appendix B. 43

51 Since there were significant differences in relative importance of the attributes when taking into account all the respondents and when only performing the analysis on respondents who did not experience any problems, time series analysis was applied to the latter as well. Figure 9 shows the changes in relative importance of the attributes that were continuously included over all five years. 5 Figure 9: Time Series Analysis, Country A (respondents with no problems) Trend analysis was then applied to the same satisfaction attribute (i.e. V1). While the previous case (where all the respondents were included in the analysis) the linear trend provided a good fit, here a better option (with R 2 = 0,8429) was a polynomial trend. 5 The remining attributes trend analysis graphs are in Appendix B 44

52 Figure 10: Trend in V1, Country A (respondents with no problems) Several attributes appeared in both analyses (i.e. when all the respondents were included and where only those who had reported experienced problem were considered). However, there are differences to note when comparing trends over the five years. This is illustrated in Figure 11. Figure 11: Trend in V8, Country A, all respondents vs. only those with no problems While satisfaction attribute V8, still follows rather similar pattern, a very big difference can be noticed in the following (figure 12), attribute V10. 45

53 Figure 12: Trend in V10, Country A, all respondents vs. only those with no problems Figure 13: Trend in V17, Country A, all respondents vs. only those with no problems The trend pattern in attribute V17 resulted in expected similarities. Since the perception of this particular attribute is directly linked to the fact whether a certain problem (especially Ven; which also has the greatest contribution to the overall satisfaction) occurred, the slope is steeper when all respondents are included in the model. 46

54 As a last step of time series analysis, the relative contribution of problem areas to the overall dissatisfaction was inspected. Figure 14: Time Series Analysis, problem areas, Country A Figure 14 illustrates an increase in relative importance of the problem area Ven on overall dissatisfaction while the remaining problems show rather stable patterns or minor decreases. 4.3 Hierarchical Logistic Regression: SAS Modeling The investigation whether the results depend on the make-model of the car or not was conducted with hierarchical logistic modeling. Table 17 illustrates the chosen variables for each level of the regression. The corresponding SAS code can be found in Appendix A. Variable Satisfaction Number of problems Recommendation Table 17: Bulding the GLIMMIX procedure Defintion Dependent variable measured at the car level; within the j-th make-model Car (micro) level variable; measuring the number of problems identified Make-model (macro) level variable; indicating whether the customer would recommend the model in question 47

55 Table 18: Country A, Year 2006 Fit Statistics 2 Res Log Pseudo-Likelihood Generalized Chi-Square ,5 Gener. Chi Square / DF 0,99 Covariance Parameter Estimates Cov Parameter Intercept Subject V4 Estimate 0,02303 Standard Error 0,01175 Table 18 is a part of the SAS output and is displaying the between make-model variance, which equals 0, Table 19 Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F V <.0001 V <.0001 The P-value from the Wald chi-square is <.0001, indicating statistically significant association between make-model and variables V387 and V227. Table 20: Solution for fixed effects Solutions for fixed effects Recommendation Effect New Car Estimate Error DF t Value Pr > t Intercept <.0001 V <.0001 V <.0001 V <.0001 The coefficient of V387 is 0,2161, and its standard error is 0, The corresponding P-value is <.0001 which indicates statistical significance. This indicates that V387 and V227 have significant effect on overall satisfaction. 48

56 4.4 Canonical Correlation Analysis Canonical correlation analysis (CCA) was applied to all satisfaction attributes and a priori specified problem areas (including only problems that are of annoying concept ). Table 21 shows the strongest possible linear combination between any sets of variables. In addition, it provides information on how many of the canonical variables are significant (i.e. the first 15). In general the number of canonical variates is equal to the number of variables in the smaller set, however the number of significant canonical variates is usually smaller. The first F- test corresponds to the hypothesis whether all canonical variates are significant, the second whether the combinations of all remaining excluding the first one are significant and so on. 49

57 Table 21 Adjusted Approximate Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation

58 Likelihood Approximate Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F < < < < < < < < < < < < < Tabl 22 illustrates several multivariate statistics. The small p-values for these tests implie rejection of the null hypothesis that all the canonical correlation are zero. Table 22 Multivariate Statistics and F S=32 M=15.5 N= Approximations Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda <.0001 Pillai's Trace E6 <.0001 Hotelling-Lawley Trace <.0001 Roy's Greatest Root <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. The canonical variables, despite being artificial can be identified in terms of original variables. The standardized canonical coefficients (table 23) are interpreted in a similar manner as for standardized regression coefficients. For example, one standard deviation increase in the first 51

59 variable (V192), leads to 0,1928 standard deviation increase in the score on the first canonical variate for set 1, ceteris paribus. Table 23 Satisf1 Satisf2 Satisf3 Satisf4 Satisf5 Satisf6 V V The first step in the explanation of the CCA is examining the sign and magnitude of the canonical weight. However, canonical weights may be affected by multicollinearity. Hence, examining canonical loadings is considered to be more reliable. Finally cross-loadings can be examined. A cross-loading is the correlation between an observed variable from satisfaction area with one canonical variate from the problem area (and vice versa). The CCA did provide a good model for identification of the linkages between satisfaction attributes and problem areas. However, that was not the case when taking only a priori specified problem sub-categories (those that are of the type annoying concept ) into consideration. While the significant relationships were found, the structural context was perfunctory. 5 Discussion and conclusions Throughout this paper several statistical techniques and applied economics methods have been implemented in order to build exploratory and predictive models that lead to accurate outputs. There were three major objectives; exploring relative importance and marginal contributions of several satisfaction attributes on overall customer satisfaction, evaluating the relationships between experienced problems with the product and satisfaction attributes and investigating whether the former depends on volume mix. The first major challenge encountered when selecting an appropriate methodology was the nature and the dimensionality of the input data. A very large number of different types of the input variables and their distributions did present severe problems for several rigid techniques. In addition, there may be several externalities affecting customer behavior and perceptions that were not measured by the survey, and therefore remain unknown. The second challenge was to 52

60 overcome the problem of multicollinearity that is common to appear in sciences with predominance of observational data and finally, the measurements used needed to be consistent over time and allow for trend detection. Given this, the methodology used needed to be very flexible, with as few underlying requirements as possible, yet computationally efficient and accurate. The first method used was the Shapley value, as it solved the core part of the research. The topic of assigning relative importance to predictors in regression is in general quite old. However, more recent developments in computational capabilities have led to applications of advanced methods and enable different approaches to decomposition of the R 2. This type of decomposition is often met in sciences that rely on observational data (i.e. psychology, economy and so forth). The metric lmg offered by the R - relaimpo - package is based on heuristic approach of averaging over all orders (Grömping, 2007, pp ). In many of the previous studies relative importance is described in purely descriptive fashion (i.e. no explanation of the statistical behavior of the variance is given). This research takes a step forward and offers a more illustrative example of the R 2 decomposition, which is important for understanding the Shapley value (which is exactly the lmg metric). The results obtained were very satisfactory, since the Shapley value is a very robust estimator and can handle very complex datasets, including large portions of missing values and different types of measurement levels of the input variables. It successfully avoids falling into the multicollinearity trap. The method is very stable in evaluating the impact of attributes measured over time. The changes in consecutive time periods are in fact due to real changes in the market. Furthermore, the basis for the key attributes analysis is in fact the Shapley value. It provides very a useful tool that can be applied to numerous problems of data modeling in various managerial fields. This research effectively identified the attributes that need managerial attention and if improved increase sales and profitability. As a result of such analysis, decision makers can implement several strategies for customer acquisition and retention. The results based on the relative importance when all the respondents were included in the model were compared to the results from a dataset limited only to respondents who did not experience 53

61 any problems, which can be seen as a type of segmentation technique that groups customers with similar behaviors and consequentially attributes preferences into two distinctive groups. This placement offers optimization of the targeting processes. The latter group (i.e. group that did not experience any problems) perceives the attributes that are directly related to the features and characteristics of the new vehicle as much more important than the group that experienced problems. Among those customers, the attributes that are of broader nature (i.e. are connected to the performance and overall quality) contribute heavily to the overall satisfaction. Moreover, the gap between the importance of the feature- related attributes and the overall quality attributes is much wider than within the first group. There are also differences to note between observing the relative importance of the attributes regarding the vehicle and the ones regarding the dealer. The latter did not show much change over time and moreover even the ranking of the attributes did not change significantly and the top list is always consisting of the same attributes. Time series analysis was then applied to satisfaction attributes (both previously mentioned group of respondents) and problem areas. Due to the fact that the questionnaire changed over the years, not all satisfaction related attributes re-appeared in all consecutive years. Therefore, only those that appeared in all five models were used in trend analysis. The data displayed many fluctuations; therefore a polynomial trend represented the best fit. However, even this was very weak in the majority of cases and several attributes (V2, V11, V13, V9, V12) did not show any trend pattern whatsoever. There were several differences to note when comparing the trend patterns of the same attribute within the group of all respondents and the group of those who reported experiencing a problem. The nature of V1 attribute is such that its relative contribution to the overall satisfaction is greater when experiencing problems (i.e. the attribute is perceived as more valuable with customers who had problems). Hence, the trend illustrates a similar but steeper pattern in the first group. While satisfaction attribute V8, still follows a rather similar shape, a very big difference can be noticed in attribute V10. Since the proportion of the customer experiencing problems did not change 54

62 significantly over the years, the explanation of these different behaviors lies in the nature of the attributes and the perceptions affected by psychological factors among those that experience a certain problem. The research continued with an investigation that combine individual-level and aggregate data are rather common. The method used was the hierarchical logistic regression. The advantage of such modeling is that it takes the hierarchical structure of the data into account. It specifies random effects on all levels of the analysis and consequentially provides more conservative implication of the aggregate fixed effects. Such aggregate data often includes valuable hints on individual behavior. An important variable to take into consideration is the willingness to recommend. It is the key metric relating to customer satisfaction. The results obtained showed statistically significant association between the make-model and variables V387 and V227. While the main benefit of using the CCA is that it provides a good exploratory technique when comparing two sets of variables, an issue of meaningfulness and significance appeared in this case. The CCA performed well with manipulated datasets (i.e. limiting the dataset to attributes and problems that are assumed to be correlated) and it was an appropriate method to choose. The problem appeared when it was applied to all the satisfaction attributes and an a priori specified set of sub-problems. While it did couple the attributes, it failed to provide satisfactory results in sense of meaningfulness (i.e. the results obtained were not logical in the sense of which satisfaction attributes coupled with which experienced problem). In order to automatize the methods used, the surveys should not vary in terms of variables and attributes over the years nor between the countries. Hence a standardization of the surveys is needed. Moreover, the coding, labels and formats of the variables in question should be synchronized. Some of the codes in the appendix A may be re-applied. The results are consistent with the key element of the company objectives, which is intention of building on customer satisfaction and retention. The results are also broadly compatible with researches done in other sectors. Hence the most important objective of the customer satisfaction analysis is revealed in this research. 55

63 While all used models did perform fairly well there is room for further investigation and research. 5.1 Proposed further research Kernel Canonical Correlation Analysis The issue with the classical canonical correlation is that it is limited to linear associations. Using Kernel methods as a pre-step in the analysis can enhance the results, by extending the classical model to a general nonlinear setting. In addition, it no longer requires the Gaussian distributional assumption for the observations Moving Coalition Analysis Mansor and Ohsato (Mansor & Ohsato, 2010) proposed a method called Motion Coalition Analysis (MCA) to observed the performance trends of a coalition over time. It divides the coalition into several sub-coalitions and determines the characteristic function of all subcoalitions. Each period is then treated as a player (Mansor & Ohsato, 2010). 56

64 6 Literature and sources 1. Alterman, T., Deddens, A.J., Constella, J.L., (NN). Analysis of Large Hierarchical Data with Multilevel Logistic Modeling Using PROC GLIMMIX. SUGI SAS Users Group International. 151 (32). 2. Dai, J., Li, Z., Rocke, D. (NN). Hierarchical Logistic Regression Modeling with SAS GLIMMIX. University of California: Davis. 3. Chantreuil, F., Trannoy, A. (1999). Inequality decomposition values: the trade-off between marginality and consistency. THEMA Discussion Paper. Universite de Cergy- Pontoise: France. 4. Conklin, M., Powaga, K., Lipovetsky, S. (NN, 2004). Customer Satisfaction Analysis: Identification of Key Drivers. European Journal of Operational Research, 3 (154), Feldman, B. (December, 1999). The proportional Value of a Cooperative Game. Accessed 21 st September, 2011 on webpage 6. Feldman, B. (March, 2007). A theory of attribution. Accessed 5 th September, 2011 on webpage 7. Garavaglia, S., Sharma, A. (NN). A smart guide to dummy variables: Four applications and a macro. New Jersey: Murray Hill. 8. GFK Customer Loyalty (NN). Getting Better Regression Results with Shapley Value Regression. Accessed on 28 th September, 2011 on webpage 9. Grömping, U. (May, 2007). Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. The American Statistician. 2 (61), pp Grömping, U. (October, 2007). Relative Importance for Linear Regression in R: The Package relaimpo. Journal of Statistical Software. 1 (17). 11. Hair, J.F, Anderson, R.E., Tatham, R.L., Black, C.W. (1998). Multivariate Data Analysis (5 th ed.). New Jersey: Prentice Hall, Inc. 12. Hart, S., Mas-Colell, A. (May, 1989). Potential, Value and Consistency. Econometrica, 57 (3), Hausknecht, R.D. (NN, 1990). Measurment Scales in Customer Satisfaction/Dissatisfaction. Accessed 16 th June, 2011 on webpage 57

65 14. Huang, S-Y., Lee, H-M, Hsiao, C.K. (August, 2006). Kernel Canonical Correlation Analysis and its Applications to Nonlinear Measures of Association and Test of Independence. Institute of Statistical Science: Academia Sinica, Taiwan. 15. Israeli, O. (March, 2007). A Shapley-based decomposition of the R-square of a linear regression. The Journal of Economic Inequality. 5, Johnson, A.R., Wichern, W.D. (2001). Applied Multivariate Statistical Analysis (5 th ed.). New Jersey: Prentice Hall. 17. Knapp, T. (March/April, 1990). Commentary: Treating Ordinal Scales as Interval Scales: An Attempt to Resolve the Controversy. Psychometrica, 39 (2), Kruskal, W. (NN, 1987). Relative Importance by averaging over orderings. The American Statistician, 41 (1). 19. Likert, R. (NN, 1932). A technique for the measurement of attitudes. Archives of Psychology. 140 (32), Lipovetsky S., Conklin M. (NN, 2001). Analysis of Regression in a Game Theory Approach. Applied Stochastic Models in Business and Industry. 17, Mansor, M.A., Ohsato, A. (NN, 2010). The Concept of Moving Colaition Analysis and its Transpose. European Journal of Scientific Research. 4 (39), Mikulic, J., Prebežac, D. (NN, 2011). A critical review of techniques for classifying quality attributes in the Kano model. Managing Service Quality, 1 (21), Petrosjan, L., Zaccour, G. (June, 2001). Time-consistent Shapley value allocation of pollution cost reduction. Journal of Economic Dynamics & Control (27), Siegel, S. (1967). Nonparametric statistics for the behavioral science. New York: McGraw-Hill Book Co. 25. Shapley, L.S. (1953). A value for n-person games. In: Kuhn, H.W. and Tucker, A.W. (eds.), (1953). Contributions to the theory of games. Princeton: Priceton University Press, Sharrocks, A.F. (1999). Decomposition Procedures for Distributional Analysis: A unified Frameowrk Based on the Shapley Value.United Kingdom: University of Essex. 27. Theil, H. (NN, 1987). How many bits of information does an independent variable yield in a multiple regression? Statistics and Probability Letters, 6 (2). 58

66 28. Von Neumann, J. (1928). On theory of playing games. English translation in: Kuhn, H.W. and Tucker, A.W. (1959) Contribution to the Theory of Games. Princeton: Princeton University Press, Weiner, J.L., Tang, J. (NN). Multicollinearity in Customer Satisfaction Research. Ipsos Loyalty: Yeung, W.K. D. (NN, 2010). Time consistent Shapley Value Imputations for Cost-Saving Joint Ventures. Accessed 21 st September, 2011 on webpage n_lang=eng. 31. Yeung D.W.K., Petrosyan L.A. (NN, 2004). Subgame consistent cooperative solutions in stochastic differential games. Journal of Optimization Theory and Applications. 120 (3),

67 Appendix A: SAS and R codes Univariate Analysis (SAS graphics) goptions reset = (axis, legend, pattern, symbol, title, footnote) colors=(black blue green red yellow cyan gold) norotate hpos=0 vpos=0 htext= ftext= ctext= target= gaccess= gsfmode= ; title1 'Frequency Distribution' color=gold; title3 underlin=1 'V191' color=red; Footnote color = green 'Dataset: ; goptions device=win ctext=blue graphrc interpo=join; pattern1 color=blue value=x1; pattern2 color=blue value=x1; pattern3 color=blue value=x1; pattern4 color=blue value=x1; pattern5 color=blue value=x1; pattern6 color=blue value=x1; pattern7 color=blue value=x1; pattern8 color=blue value=x1; pattern9 color=blue value=x1; pattern10 color=blue value=x1; axis1 color=blue width=2.0; axis2 color=blue width=2.0; axis3 color=blue width=2.0; proc gchart data=; hbar V191/DISCRETE; run; Dummy variables MACRO (SAS) option nosymbolgen mlogic mprint obs= ; libname; filneame out; data; set; ; /*MACRO PARAMETERS : dsn = input dataset name, var = variable to be categorized, prefix = categorical variable prefix, flat = flatfile name with code (referenced in file name statement)*/ 59

68 %macro dmycode (dsn =, Var =, Prefix =, Flat = ); proc summary data = &dsn nway; class &var ; output out = x (keep=&var); proc print; *; data _null_; set x nobs=totx end=last; if last then call symput ( tot, trim(left(put(totx, best.)))); call symput ( z trim (left ( put (_n_; best. ))), trim(left (&var))); data _null_; file &flat; %do i=1 to %tot; put &prefix&&z&i =0; ; %end put SELECT; ; %do i=1 %to &tot; put when (&var= &&z&i) &prefix&&z&i = 1; ; %end put otherwise V_oth=1; ; put end; ; run; %mend dmycode; %dmycode (dsn =, var =, prefix =, flat =out); run;quit; Relative Importance (R-code) >linmod <- lm( response_variable ~.., data=) >metrics <- calc.relimp(linmod, type = c( lmg, first, last ), rela=true) >metrics Canonical Correlation (SAS-Code) proc cancorr corr data= vprefix = problems wprefix = satisfaction vname = Problem Areas wname = Satisfaction Areas ; var Vp Ve Vw Vb Vo Vi Vel Ven Vcl Vbr Vsw Vs Vex Vot; with v191 v192 v193 v194 v195 v196 v197 v198 v199 v200 v201 v202 v203 v204 v205 v206 v207 v208 v209 v210 v211 v212 v213 v214 v215 v216 v217 v218 v219 v220 v221 v222 v223 v224; run;

69 Hierarchical Logistic Regression (SAS Code) data; set; IF V >= 5 then V_S = 1; Else V_S = 0; Keep V_S make-model number_problems recommendation; run; proc glimmix; class make-model recommendation; model V_S = number_problems recommendation / dist=binary link=logit ddfm=bw solution; random intercept / subject=make-model; run;

70 Appendix B: Outputs I. Relative Importance (R-Output) Attributes related to the vehicle Country A: Year 2006 Response variable: V191 Total response variance: Analysis based on observations 20 Predictors: V200 V195 V220 V194 V196 V198 V201 V212 V216 V213 V214 V192 V206 V207 V209 V197 V215 V218 V224 V210 Proportion of variance explained by model: 56.21% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

71 Country A, Year 2007 Response variable: V243 Total response variance: Analysis based on observations 20 Predictors: V252 V247 V272 V376 V250 V246 V266 V270 V253 V259 V267 V261 V244 V248 V374 V268 V265 V258 V271 V249 Proportion of variance explained by model: 54.55% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

72 Country A, Year 2008 Response variable: V185 Total response variance: Analysis based on observations 20 Predictors: V358 V350 V189 V193 V188 V186 V201 V348 V191 V204 V345 V198 V199 V190 V200 V192 V346 V205 V356 V357 Proportion of variance explained by model: 57.11% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V V V V V V V V V V V V V V V V V V V V

73 Country A, Year 2009 Response variable: V166 Total response variance: Analysis based on observations 20 Predictors: V174 V200 V172 V169 V192 V170 V185 V182 V198 V167 V190 V186 V179 V187 V188 V173 V180 V171 V191 V189 Proportion of variance explained by model: 56.19% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V V V V V V V V V V V V V V V V V V V V

74 Country A, Year 2010 Response variable: V136 Total response variance: Analysis based on observations 20 Predictors: V169 V144 V142 V140 V162 V156 V158 V157 V139 V160 V161 V155 V150 V153 V149 V152 V167 V141 V170 V137 Proportion of variance explained by model: 57.96% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

75 II. Attributes related to the vehicle (among those that did not experience any problems) Country A, Year 2006 Response variable: V191 Total response variance: Analysis based on observations 20 Predictors: V200 V194 V195 V209 V220 V192 V214 V201 V215 V196 V198 V207 V218 V217 V206 V213 V197 V219 V221 V212 Proportion of variance explained by model: 54.84% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

76 Country A, Year 2007 Response variable: V243 Total response variance: Analysis based on observations 20 Predictors: V252 V246 V376 V247 V266 V261 V244 V272 V250 V253 V267 V248 V259 V270 V269 V377 V258 V265 V249 V271 Proportion of variance explained by model: 52.59% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

77 Country A, Year 2008 Response variable: V185 Total response variance: Analysis based on observations 20 Predictors: V193 V188 V201 V350 V186 V345 V191 V358 V199 V348 V346 V347 V198 V189 V359 V205 V200 V190 V356 V351 Proportion of variance explained by model: 55.29% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

78 Country A, Year 2009 Response variable: V166 Total response variance: Analysis based on observations 20 Predictors: V174 V169 V182 V172 V187 V192 V167 V200 V190 V189 V180 V188 V170 V179 V201 V181 V186 V198 V171 V193 Proportion of variance explained by model: 55.26% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

79 Country A, Year 2010 Response variable: V136 Total response variance: Analysis based on observations 20 Predictors: V144 V139 V152 V142 V157 V137 V162 V169 V150 V170 V160 V159 V158 V140 V149 V156 V161 V141 V167 V151 Proportion of variance explained by model: 51.52% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e V e

80 III. Attributes related to the dealer Country A, Year 2006 Response variable: V90 Total response variance: Analysis based on observations 13 Predictors: V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 Proportion of variance explained by model: 83.44% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e

81 Country A, Year 2007 Response variable: V93 Total response variance: Analysis based on observations 13 Predictors: V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 Proportion of variance explained by model: 84.2% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e

82 Country A, Year 2008 Response variable: V51 Total response variance: Analysis based on observations 13 Predictors: V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 Proportion of variance explained by model: 84.63% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e

83 Country A, Year 2009 Response variable: V129 Total response variance: Analysis based on observations 13 Predictors: V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 Proportion of variance explained by model: 84.65% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V e V e V e V e V e V e V e V e V e V e V e V e V e

84 Country A, Year 2010 Response variable: V99 Total response variance: Analysis based on observations 13 Predictors: V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 Proportion of variance explained by model: 85.27% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first V V V V V V V V V V V V V

85 IV. Problem Areas Country A, Year 2006 Response variable: V191 Total response variance: Analysis based on observations 14 Predictors: Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot Proportion of variance explained by model: 14.41% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot

86 Country A, Year 2007 Response variable: V243 Total response variance: Analysis based on observations 14 Predictors: Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot Proportion of variance explained by model: 14.12% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot

87 Country A, Year 2008 Response variable: V185 Total response variance: Analysis based on observations 14 Predictors: Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot Proportion of variance explained by model: 14.39% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot

88 Country A, Year 2009 Response variable: V166 Total response variance: Analysis based on observations 14 Predictors: Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot Proportion of variance explained by model: 12.77% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot

89 Country A, Year 2010 Response variable: V136 Total response variance: Analysis based on observations 14 Predictors: Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot Proportion of variance explained by model: 14.41% Metrics are normalized to sum to 100% (rela=true). Relative importance metrics: lmg last first Vp Ve Vw Vb Vo Vi Vel Ven Vc Vbr Vsw Vs Vex Vot

90 V. Trend Analysis Trend Analysis (comparison) The above trend line is fitted to the V7 attribute and is including only the respondents who did not experience any problems. The fit line is quite poor, however when taking the whole dataset into consideration, there was no pattern at all. The bellow figure, on the other hand, is showing weak pattern when all the respondents were included in the analysis, but did not display any trend pattern at all among those that did not experience any problem. 56

91 Trend line fitted to the relative importance of the attribute V3 (including all the respondents) displayed a good fit, while it did not show any pattern at all, when considering only respondents who did not report any problem. To emphasise the differences between the two datasets, V15 provides a clear example of different movements in relative importance over the five years (figures below).

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information Finance 400 A. Penati - G. Pennacchi Notes on On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information by Sanford Grossman This model shows how the heterogeneous information

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Statistics 104: Section 6!

Statistics 104: Section 6! Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: [email protected] Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

An introduction to Value-at-Risk Learning Curve September 2003

An introduction to Value-at-Risk Learning Curve September 2003 An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. 277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies

More information

8.1 Summary and conclusions 8.2 Implications

8.1 Summary and conclusions 8.2 Implications Conclusion and Implication V{tÑàxÜ CONCLUSION AND IMPLICATION 8 Contents 8.1 Summary and conclusions 8.2 Implications Having done the selection of macroeconomic variables, forecasting the series and construction

More information

Chapter 7. Sealed-bid Auctions

Chapter 7. Sealed-bid Auctions Chapter 7 Sealed-bid Auctions An auction is a procedure used for selling and buying items by offering them up for bid. Auctions are often used to sell objects that have a variable price (for example oil)

More information

Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation

Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation Author: TAHIR NISAR - Email: [email protected] University: SOUTHAMPTON UNIVERSITY BUSINESS SCHOOL Track:

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

Schools Value-added Information System Technical Manual

Schools Value-added Information System Technical Manual Schools Value-added Information System Technical Manual Quality Assurance & School-based Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Introduction to time series analysis

Introduction to time series analysis Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

More information

A Log-Robust Optimization Approach to Portfolio Management

A Log-Robust Optimization Approach to Portfolio Management A Log-Robust Optimization Approach to Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

3.2 How to translate a business problem statement into an analytics problem

3.2 How to translate a business problem statement into an analytics problem Kano s model 1 Kano s model distinguishes between expected requirements so-called must-be requirements normal requirements, and exciting requirements, also called attractive requirements 2 Expected requirements

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.

A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen. A C T R esearcli R e p o rt S eries 2 0 0 5 Using ACT Assessment Scores to Set Benchmarks for College Readiness IJeff Allen Jim Sconing ACT August 2005 For additional copies write: ACT Research Report

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Multivariate Analysis. Overview

Multivariate Analysis. Overview Multivariate Analysis Overview Introduction Multivariate thinking Body of thought processes that illuminate the interrelatedness between and within sets of variables. The essence of multivariate thinking

More information

Segmentation: Foundation of Marketing Strategy

Segmentation: Foundation of Marketing Strategy Gelb Consulting Group, Inc. 1011 Highway 6 South P + 281.759.3600 Suite 120 F + 281.759.3607 Houston, Texas 77077 www.gelbconsulting.com An Endeavor Management Company Overview One purpose of marketing

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

More information

AN ILLUSTRATION OF COMPARATIVE QUANTITATIVE RESULTS USING ALTERNATIVE ANALYTICAL TECHNIQUES

AN ILLUSTRATION OF COMPARATIVE QUANTITATIVE RESULTS USING ALTERNATIVE ANALYTICAL TECHNIQUES CHAPTER 8. AN ILLUSTRATION OF COMPARATIVE QUANTITATIVE RESULTS USING ALTERNATIVE ANALYTICAL TECHNIQUES Based on TCRP B-11 Field Test Results CTA CHICAGO, ILLINOIS RED LINE SERVICE: 8A. CTA Red Line - Computation

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction

More information

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper

More information

A Comparison of Variable Selection Techniques for Credit Scoring

A Comparison of Variable Selection Techniques for Credit Scoring 1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

More information

FACTOR ANALYSIS NASC

FACTOR ANALYSIS NASC FACTOR ANALYSIS NASC Factor Analysis A data reduction technique designed to represent a wide range of attributes on a smaller number of dimensions. Aim is to identify groups of variables which are relatively

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4 4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

Data Mining Introduction

Data Mining Introduction Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected]

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected] Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Alok Gupta. Dmitry Zhdanov

Alok Gupta. Dmitry Zhdanov RESEARCH ARTICLE GROWTH AND SUSTAINABILITY OF MANAGED SECURITY SERVICES NETWORKS: AN ECONOMIC PERSPECTIVE Alok Gupta Department of Information and Decision Sciences, Carlson School of Management, University

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

How To Understand Multivariate Models

How To Understand Multivariate Models Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

CPC/CPA Hybrid Bidding in a Second Price Auction

CPC/CPA Hybrid Bidding in a Second Price Auction CPC/CPA Hybrid Bidding in a Second Price Auction Benjamin Edelman Hoan Soo Lee Working Paper 09-074 Copyright 2008 by Benjamin Edelman and Hoan Soo Lee Working papers are in draft form. This working paper

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

Published entries to the three competitions on Tricky Stats in The Psychologist

Published entries to the three competitions on Tricky Stats in The Psychologist Published entries to the three competitions on Tricky Stats in The Psychologist Author s manuscript Published entry (within announced maximum of 250 words) to competition on Tricky Stats (no. 1) on confounds,

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information