Free Trial - BIRT Analytics - IAAs 11. Predict Customer Gender Once we log in to BIRT Analytics Free Trial we would see that we have some predefined advanced analysis ready to be used. Those saved analysis is what we call Instant Advanced Analysis (IAAs). If we double click over My folders > Demo Retail Customer Analytics we will see a list of seventeen (17) saved analysis, as an introduction of what we can do with BIRT Analytics in an environment of Customer Analytics in a retail commerce (our demo database is based in a home improvement retailer example).
Those eleven categories are built to cover distinct areas of Customer Analytics that are of high value to any retailer, and try to answer some questions like: 1. How is the performance of my sales by product category? 2. An RFM approach: Who are our best RFM customers? How do they look like? 3. Advanced segmentation of your customers in order to focus your marketing efforts. 4. Who are your churn customers? How do they look like? And what is more important, which one of your loyal customers is more likely to become a churner in the future? 5. How your products are associated in a basket? Which is the best next offer when a customer have product A and B in his basket? 6. Discover new cross sell opportunities. 7. The value of my customers will grow or it will decrease in the next months? 8. What is the voice of the customer telling about us in the Social Media? 9. Is there any relationship between data in the twitter interactions? 10. Can we predict the total audience a tweet could reach? 11. Can we complete empty gender values from our customer s data to target them with the correct marketing campaign? 2
Now, we are going to answer the tenth question: Can we complete empty gender values from our customer s data to target them with the correct marketing campaign? When we double-click this analysis, BIRT Analytics goes to Analytics > Advanced and shows the Parameters tab, with the starting data of the model. 3
This logistic regression (Analytics > Advanced > Logistic regression) is defined using certain parameters: A Domain that are all the Customers that had a Gender defined (values are not null) A dependent variable, the one we want to predict. If a certain customer is Female (a new column with a 1 (yes) or 0 (no) result, a binary response). A selection of continuous independent variables: customer age, if this customer is an internet customer, if this customer allowed us to send emails to him and if this customer is a store customer 4
We can see how looks like the logistic regression calculated in the Results tab. BIRT Analytics provide the logistic function that recreates the predictive model. Providing this equation of the four independent variables, it returns a prediction of the probability that this customer is a female. Below the equation there is a 5-star qualification of the goodness of fit of the equation compared with the real data of the original Domain. In our case, we have 5 stars that means that this predictive model is really accurate. This rating is done using the p-value of the Chi Squared test that is showed in the Statistics tab. 5
This third tab shows all the test and data used in the qualification of the goodness of fit of the logistic model. The tap is divided in two main parts, the upper are the global fitting test (evaluate all the equation) and the lower table shows specific tests for each of the coefficients of the equation (including the intercept). This kind of regression is globally evaluated by two distinct tests: Chi squared test and its p-value Log Likelihood ratio statistic test The first one needs to be a high value, or its p-value need to be as small as possible (under 0.01 we can assume that the model fits the train sample). The Log likelihood is always negative and needs to be close to zero. In our example, we have a good first test, with a Chi Squared quite high and a p-value smaller than 0.01, but in the other hand, the log likelihood value is a big negative value. That means that this model doesn t fit as well as expected at a first sight. 6
Each coefficient in the logistic equation has its tests. Standard error is a measure of the mean error that we are assuming when we compare the logistic equation (the predictive model) against the real data. Odds ratio test is a test to measure the influence of the independent variable (related to that coefficient) over the dependent. As bigger is the ratio, better is the relationship between dependent variable and the independent. The Upper and Lower Confidence level are defined for each coefficient. It is related to the Odds ratio. The p value of the log likelihood ratio for a certain coefficient. Only evaluates this coefficient, but it has the same interpretation as the global test. The significance level is based in the distinct values of the p-value of the log likelihood ratio, and is a 0 to 5 scale to evaluate how relevant is a certain independent variable in the equation. One of the tricks of logistic regression is that is not only based in one test to evaluate the goodness of fit. It needs a multiple variable test to analyze if the model is good enough or not. In our example, analyzing each coefficient we can conclude that: Intercept: Has a low standard error, but it s Odds Ratio is quite low, so this coefficient could be zero, because of the low relevance in predicting the dependent variable. Age: This variable is slightly relevant due to its Odds ratio, although its significance level is the highest. Internet Customer EQ Y: This coefficient is more relevant that the Age (a bigger Odds ratio) and it also has a high significance level. Mailable EQ Y: This categorical variable is the least relevant of all the predictors, because of its Odds ratio value. Store Customer EQ Y: This is the best variable, the most relevant, that defines this model. It has the highest Odds ratio and a high significance level. This model could be applied in a new column to classify those that doesn t have a Gender assigned to predict their probability to be a female. If you want to know more about this data mining technique you can find more documentation of linear regressions in: http://developer.actuate.com/resources/documentation/birt-analytics/4-4/ Copyright 2014 Actuate Corporation. All rights reserved. Actuate, legodo, BIRT ihub, BIRT ihub F-Type, BIRT Analytics, Actuate Customer Communications Suite, The Actuate Document Accessibility Appliance, BIRT ondemand, BIRT Viewer Toolkit, and the Actuate logo are trademarks or registered trademarks of Actuate Corporation and/ or its affiliates in the U.S. and certain other countries. The use of the word partner or partnership does not imply a legal partnership relationship between Actuate and any other company. All other brands, names or trademarks mentioned may be trademarks of their respective owners. Actuate Corporation 951 Mariners Island Boulevard San Mateo, CA 94404 Tel: (+1) 888-422-8828 BIRTAnalytics@actuate.com www.actuate.com/birtanalytics 7