SAP Predictive Analysis Real Life Use Case Predicting Who Will Buy Additional Insurance

Transcription

1 SAP Predictive Analysis Real Life Use Case Predicting Who Will Buy Additional Insurance Using SAP Predictive Analysis to predict customers who will most likely buy additional Insurance, based on known customer attributes Applies to: Frontend-tools: SAP Predictive Analysis SP14 & SAP InfiniteInsight (formerly known as KXEN). Backend-tools: SAP HANA & PAL (Predictive Analysis Library). This document was originally developed with SAP PA SP11 but with the complete new design in SP14 a request was made to update the document. Furthermore at the same we time added comments from fellow peers working within the SAP Predictive Analysis community. Summary: This paper provides a step-by-step description, including screenshots, to evaluate how SAP Predictive Analysis and SAP InfiniteInsight, which as of this writing are now sold as a bundle, can be used to predict the potential customers who will buy additional products based on their behavior of interest. Using the combined strength of both SAP InfiniteInsight and SAP Predictive Analysis this article will also demonstrate how these two products can fulfill the challenge and also supplement each other to provide even better prediction models. This article is for educational purposes only and uses actual data from an insurance company. The data comes from the CoIL (Computational Intelligence and Learning) challenge from year 2000, which had the following goal: Can you predict who would be interested in buying a caravan insurance policy and give an explanation why? After reading this article you will be able to understand the differences between classification algorithms. You will learn how to simplify a dataset by determining which variables are important and vice versa, score a model and also how to use SAP Predictive Analysis and SAP InfiniteInsight to build models on existing data and run the custom models on new data. The authors of this article have closely observed the online Predictive Analysis community where members are constantly looking for two important things; which statistical algorithm to choose for which business case and real life business cases with actual data. The latter is certainly difficult to find but immensely crucial in understanding of this subject. Of course data is vital to each company and in today s competitive market, gives a competitive advantage. Finding a real life case for educational purposes with actual data is a big challenge. Final 2013 SAP AG 1

2 Authors Bio Kurt Holst is an experienced consultant on SAP Predictive Analysis & BusinessObjects Enterprise architecture, configuration and performance optimization. Kurt has worked with most major BI reporting & data mining tools together with different OLAP databases, SAP NetWeaver BW and relational databases. Before joining SAP, Kurt worked as a BI Architect & developer both internal and as an external consultant. Furthermore, Kurt has been attached to the Southern university of Denmark as an external lecturer on Business Intelligence, Data Warehousing & Data Mining. Kurt helps customers transform data into knowledge using predictive analysis or business intelligence and is organizational within the Business Analytics Services team at SAP Denmark. Atul Manga (atul.manga@sap.com) is an experienced consultant with a solid understanding and deep knowledge within design and implementation of business solutions in SAP BusinessObjects with focus on reports, universe creation, dashboards and user setup. Atul has experience from projects in small, medium and large enterprises within various industries like engineering, advertising, auditing, nonprofit organizations and consultancies. Atul has worked in different roles such as lead reporting architect, reporting specialist and business analyst. Atul works with the Business Analytics Services team in SAP Denmark. Company: SAP Last update: January Disclaimer: Please note that information and views set out in this article are those of the authors and do not necessarily reflect the official roadmap, opinion or strategy of the SAP. Neither SAP nor any person acting on their behalf may be held responsible for the use which may be made of the information contained therein SAP AG 2

3 Contents 1 Document update history Executive summary Background Definitions and expected foundations for the reader Terms and definitions CRiSP-DM SEMMA COIL Data Mining Decision tree Neural network Naïve Bayes Linear regression model Logistic regression model Ensemble model Confusion matrix Stepwise regression SAP Predictive Analysis SAP InfiniteInsight (formerly known as KXEN InfiniteInsight) Data mining implementation methodologies: Data mining implementation methodologies in a SAP Predictive Analysis perspective Implementation approach using CRiSP-DM Business understanding Determine Business Objectives Determine Business Objectives for Insurance case Assess situation Assess situation for Insurance case Determine Data Mining Goals Determine Data Mining Goals for insurance case Produce Project Plan Produce Project Plan for insurance case Data understanding Collect initial data Collect initial data for insurance case Describe Data Describe data for insurance case Explore Data Gaining data understanding Verify Data Quality Verify data quality for insurance case Data preparation Select data Clean data Construct data Normalizing data Construction of new variables SAP AG 3

4 9.3.3 Binning: Integrate data Format data Formatting data in SAP InfiniteInsight: Format data for insurance case Balancing data Modeling Select Modeling Technique Describe Data for Insurance case Determining correlation coefficients selection of algorithms & method Generate Test Design Generate Test Design for our insurance case Build Model SAP Predictive Analysis R based Decision tree SAP Predictive Analysis - Decision tree scoring the model SAP PA with Decision Tree algorithm using boosted and binned data: SAP Predictive Analysis Logistic Regression (HANA PAL) SAP Predictive Analysis Logistic Regression (HANA PAL) with balanced dataset SAP Predictive Analysis Logistic Regression (HANA PAL), boosted dataset SAP Predictive Analysis Logistic Regression (HANA PAL), binning & boosted dataset SAP Predictive Analysis - Neural network tree SAP Predictive Analysis Neural Network boosted data SAP HANA PAL natively using HANA Studio The HANA PAL CHAID based Decision Tree Scoring the model SAP HANA PAL C4.5 DecisionTree: SAP Predictive Analysis correlation matrix to determine important variables: SAP InfiniteInsight building model SAP InfiniteInsight Using Gain chart and default data description SAP InfiniteInsight Using Gain chart and actual metadata description SAP InfiniteInsight Using Decision Mode and default metadata description SAP InfiniteInsight Decision mode and actual metadata description SAP InfiniteInsight Gaining additional insights from build models SAP Predictive Analysis - Custom R scripts Naïve Bayes: Re-modeling in SAP Predictive Analysis with the variables detected by SAP InfiniteInsight Advanced data mining with ensemble models SAP PA Advanced data mining with ensemble models SAP PA & SAP InfiniteInsight Advanced data mining with ensemble models with voting and sorting Model comparison high level summary of algorithm performance Evaluation Evaluate Results Review Process Determine Next Steps Deployment Conclusion Perspectives Reference Documentation Evaluation of strength of correlation-coefficient SAP AG 4

5 17 Appendix Data dictionary Appendix Master Data dictionary Customer type Appendix Data dictionary Customer masterdata Appendix variables to be used in the Naïve Bayes algorithm Appendix using SAP HANA PAL natively: SAP AG 5

6 1 Document update history Date Name Alteration Reason Version Kurt Holst Initial document Kurt Holst Background and overview Kurt Holst Building data understanding Atul Mange & Kurt Holst SAP PA exploring possible algorithms to predict Kurt Holst Adding best practices algorithms Atul Manga & Kurt Holst Finalizing first draft to share internally for peer review and quality assurance Atul Manga & Kurt Holst Incoorporating feedback from internal peers Kurt Holst Adding KXEN software tool Kurt Holst Finalizing document in preparation to share on SAP Community Network Kurt Holst Sharing on SAP Community Network Kurt Holst Update with SAP PA SP Kurt Holst Update with feedback from peers, specifcally Erik Marcade, CTO KXEN and John MacGregor, VP and Head of the Centre of Predictive Analytics, CD&SP Atul Manga & Kurt Holst Proof-reading document Kurt Holst Finalizing document to SAP Community Network Kurt Holst Adding ensemble models methode Kurt Holst Adding HANA PAL Decision Trees SAP AG 6

7 2 Executive summary The purpose of this article is to demonstrate how the two data mining tools, SAP Predictive Analysis and SAP InfiniteInsight (former known as KXEN InfiniteInsight), could be used for research and investigation to answer a competition created by Computational Intelligence and Learning (CoIL) with real world business questions and actual customer data. As shown in this research article using the combined strengths of SAP Predictive Analysis and SAP InfiniteInsight can produce very decent result in an international data mining challenge with real data and business questions in fact the best performing model would come in at a tied 1 st place with an American professor in statistics & his team. This article takes the reader through an end-to-end scenario using the framework from a wellestablished and structured implementation approach specific to data mining and in the context of using SAP Predictive Analysis, SAP InfiniteInsight and SAP HANA Studio to provide a solution to the data mining challenge. The data file features the actual dataset from an insurance company and it contains 5822 customer records of which 348, about 6%, had caravan policies. Each record consists of 86 attributes, containing socio-demographic data product ownership. The test or validation set contained information on 4000 customers randomly drawn from the same population. For model comparison purposes each model must be evaluated as to which of the 800 customers would be most likely to buy caravan policies. Meaning that when scoring a model the highest 20% (800 customers of in total 4000 in the test set) customers must be presented and evaluated for a true prediction. Selecting the 800 customers randomly would in average only produce 48 customers that would be buying. The competition had two objectives the first being to build a model that could predict most customers from the 800 highest probable customers in the testing dataset. Compared to the other contestant of the competition this would place our predictive models in a split 1 st, 2 nd and 3 rd place. The end results are summaries in this comparison matrix where the overall goal of the challenge was to predict as many customers as possible that would be interested in buying a caravan insurance policy. The results from the research showed that the R based Naïve Bayes algorithm based model predicted 92 potential customers willing to buy a caravan policy using SAP Predictive Analysis, 116 based on SAP Predictive Analysis using Decision Tree, 117 customers with Logistic Regression with default settings, 110 using SAP InfiniteInsight default settings including no meta data description, and 2014 SAP AG 7

8 the best being 118 customers using SAP InfiniteInsight with the actual meta data description. In the public challenge the overall winner predicted 121 with significant effort from a university professor of statistics and his team. Secondly the competition had another objective which was to find and explain the driving variables that makes customers buy caravan insurance. As shown in the section using SAP InfiniteInsight it has a capability of visualizing how each variable impact the target variable and more interesting for answering this second objective the Category Significance functionality. Not all that surprisingly the variable Contribution car policies showed a significant correlation to the target variable buying caravan insurance but also that the categories 6 to 8 was driving to a higher likelihood of having a customer buying and vice versa with categories 0 to 5. Other variables driving the target were besides the car policies also the fire policies and customer subtypes. The tools powerful visualizations such as the decision tree support answering the second objective. The research of course analyzes specifically binary classification models, a single, but very important class of predictive models, and we should be careful extrapolating our success to other classes of models. As demonstrated in this article, our experiments show that both of our frontend predictive analysis tools are providing a very good off-the-shelf prediction results with SAP Predictive Analysis and SAP InfiniteInsight in the challenge using the classification algorithms Logistic Regression, Decision Tree or InfiniteInsight. As shown in this article if necessary, new interaction features or cross products, binning, balancing or boosting (oversampling) can improve accuracy without impairing comprehensibility and discussions with the domain experts. Furthermore, the article explains how to prune a dataset, balance or boost a training dataset, score a model and use SAP Predictive Analysis and SAP InfiniteInsight to build models on known data and run trained models on new (unknown) data to predict most probable outcomes. Going forward we could enhance our scoring even further by using some of the other data mining techniques as described in this article such as further data mining experiments with binning, creating new columns based on other columns or deriving variables etc. As shown in this research article the end user is not required to have any expert programming skills or a PhD degree in statistics to achieve a very decent result in data mining with SAP Predictive Analysis and SAP InfiniteInsight. However the focus of this article was to demonstrate that with SAP Predictive Analysis, R and SAP InfiniteInsight out-of-the-box can produce best in class results. So to compete with the winning submission we will also try to demonstrate how to use more advanced data mining techniques to squeeze a few extra percentage points out of the predictive models SAP AG 8

9 3 Background This article is only for educational purposes and uses actual data from an insurance company. The authors of this article have closely observed the Predictive Analysis community online where members are constantly looking for two important things; which statistical algorithm to choose for which business case and real life business cases with actual data. The latter is certainly difficult to find but immensely crucial in understanding of this subject. However, in today s competitive market environment such data is kept under wraps due to data security legislations and also because it can become a competitive advantage for the company. Finding a real life case for educational purposes with actual data is therefore a challenge. In this article we will address the practical implementation of actual data. The data comes from the CoIL (Computational Intelligence and Learning) challenge from year 2000 (CoIL, 2000), which had the following goal: Can you predict who would be interested in buying a caravan insurance policy and give an explanation why? The overall goal is broken down into two research questions: Predict which customers are potentially interested in a caravan insurance policy. Describe the actual or potential customers; and possibly explain why these customers buy a caravan policy. The purpose of this article is to demonstrate how SAP Predictive Analysis and SAP InfiniteInsight could be used for research and investigation to answer these questions. Figure 1: What determines the set of algorithms that can assist predicting & describing? 2014 SAP AG 9

10 Today companies focus heavily on marketing strategies on how to approach customers and which marketing channels to use. One of the business benefits from using predictive analysis could be to determine customer preferences and accordingly develop targeted marketing campaigns, for example, use direct mailing to a company s potential customers as an effective way for them to market a product or a service instead of mass distributed or junk mail. If the companies have a better understanding of who their potential customers are, they would know more accurately who to send it to, and thereby reduce the waste and expense. Among other benefits that could be realized from using this methodology and SAP Predictive Analysis and SAP InfiniteInsight are the following: The right segmenting and selection of variables in a data mining solution makes the story-telling with the predictive findings more trustworthy and in essence the organization will know what is the overall goal, strategic objectives and how they individually can contribute to the company by achieving the specific underlying KPI s or variables. Predicting the outcome based on historical variables and especially determining whether a number of independent variables change will have a later impact on other dependent variables. Measuring the correlation strength of variables which could help select the right variables to a classification algorithm. Using statistical methodology to prove the correlation between variables could uncover unknown information Sanitation of variables in a data mining solution many times a lot of variables is for a safe or organizational-reasons-sake put into an algorithm. Evaluating if there is actually a need for all these variables could lead to fewer variables to extract, clean & monitor. This could also lead to higher transparency and less need for maintenance and data preparation. Increase the integrity and evidence based reasoning of the story told in a predictive case. This is just one example of how SAP Predictive Analysis, with just one algorithm, can enhance a performance management solution. This research is based on SAP Predictive Analysis version 1 Servicepack 14 and SAP InfiniteInsight Besides providing statistically proven reasoning to the predictive case, SAP Predictive Analysis could also assist in many other areas which will be further explored in this article. The frontier for using data to make decisions has shifted dramatically. Certain high-performing enterprises are now building their competitive strategies around data-driven insights that in turn generate impressive business results. Their secret weapon? Analytics: sophisticated quantitative and statistical analysis and predictive modeling. Quote from Thomas H. Davenport in Competing on Analytics 2014 SAP AG 10

11 4 Definitions and expected foundations for the reader This article provides a step-by-step description, including screenshots, to evaluate how SAP Predictive Analysis can be used to predict the potential customers who will buy additional products based on their behavior of interest. 4.1 Terms and definitions Throughout this article different terms and definitions have been used and, to provide you with a common understanding of these, a short description is given below. Each of the applied algorithms is discussed in more detail in section 10 Modeling CRiSP-DM Cross Industry Standard Process for data mining is a hierarchical process model which consists of phases and tasks. This model is also used to give an overview of life cycle of a data mining project. CRiSP-DM was developed during 1996 by three veterans of the young and immature data mining market. More detailed information on the model will be explained in the following chapter SEMMA SEMMA is a method developed by SAS Institute which contains 5 steps: Sample, Explore, Modify, Model and Assess. These are sequential steps which are used for the implementation of data mining applications COIL CoIL: Computational Intelligence and Learning Data Mining This is a process of analyzing and summarizing data in order to discover patterns, trends and identify relationships. Data mining uses mathematical algorithms and different techniques such as artificial intelligence, neural network and statistical tools to segment data Decision tree A decision tree is an analysis diagram which is used as a support tool that helps decision makers in deciding between options and outcomes. A decision tree consists of nodes and branches and is used in various contexts such as graph theory, algorithms and probability Neural network A neural network is a mathematical model with the purpose of learning to recognize patterns in the data. In order to recognize patterns the neural network must be trained using different techniques from sample data. The goal for neural network is to perform predictions in future data SAP AG 11

12 4.1.7 Naïve Bayes Naïve Bayes is a group of statistical techniques that uses conditional probabilities for predicting and classifying outcomes. It is called naïve because the algorithm does not take into account any correlation or between independent variables Linear regression model This model analyzes the relationship between a scalar dependent variable and one or more independent variables. There exist different regression models and techniques Logistic regression model Logistic regression is specifically designed for Boolean dependent or target variables and often used in cases where the goal is to predict false or true outcomes such as insurance fraud, churn etc Ensemble model Ensemble models are considered a more advanced data mining technique where multiple models are combined to produce better predictions and more robust models Confusion matrix A confusion matrix is used for classifying results into actual and predicted information. Usually this matrix contains two rows and two columns with the following four outcomes: false positives, false negatives, true positives, and true negatives. This is commonly used within supervised machine learning and ideally with used together with a cost structure. Def.: TP = True Positives, FN = False Negatives, FP = False Positives and TN = True Negatives. The Classifier Confusion Matrix should be supplemented there with several measures of model quality to determine which model gives the most comprehensible rules. The accuracy (AC) is the proportion of the total number of predictions that were correct: AC = (TP+TN) / (TP+FN+FP+TN) Sensitivity (S), which measures the model s ability to identify positive results S = TP / (TP + FN) Finally, precision (P) is the proportion of the predicted positive cases that were correct. P = TP / (TP + FP) Reference: Proposed calculation to score a model (MacGregor, 2013) Stepwise regression This is a step-by-step method which consists of a process of adding and removing independent variables from a model in order to find the best statistically significant solution. A stepwise regression includes forward selection and backward elimination. The main approaches for stepwise regression are usually divided up into: Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model SAP AG 12

13 Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible. Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded. ( SAP Predictive Analysis SAP Predictive Analysis is a statistical analysis and data mining solution that enables you to build predictive models to discover hidden insights and relationships in your data, from which you can make predictions about future events. SAP Predictive Analysis is a complete data discovery, visualization, and predictive analytics solution designed to extend your current analytics capability and skillset, regardless of your history with BI. It s an intuitive, dragand-drop, code-free experience with enough power for data scientists to conduct more sophisticated analysis using Big Data, yet simple enough to allow business analysts to conduct forward-looking analysis using departmental data from Excel. SAP Predictive Analysis is a rich client application that allows you to intuitively design complex predictive models, visualize, discover, and share hidden insights and harness the power of Big Data with SAP HANA: Intuitively design complex predictive models Visualize, discover, and share hidden insights Unleash Big Data with SAP HANA s power Embed Predictive in to Apps and BI environments Real-time answers SAP InfiniteInsight (formerly known as KXEN InfiniteInsight) SAP InfiniteInsight is a predictive modeling suite developed by KXEN that assists analytic professionals and business executives to extract information from data. Among other functions, SAP InfiniteInsight is used for variable importance, classification, regression, segmentation, time series, product recommendation. One of the driving technologies behind SAP InfiniteInsight is to the writers knowledge the Statistical Learning Theory or Vapnik Chervonenkis theory accredited to mathematician Vladimir Vapnik. The KXEN company was acquired by SAP in Q SAP AG 13

14 5 Data mining implementation methodologies: With data mining projects it is essential to have a standard process or to follow a methodology, because it provides a framework for recording experiences and enables projects to be replicated by anyone. Furthermore, by using a methodology it encourages best practices and helps to obtain better results. In this research, different methodologies have been considered in regards to the implementation of predictive analysis. To select the appropriate strategy; our decision is partly based on a survey from August Here users were supposed to answer what they considered as the main methodology for data mining. As seen from figure 2 the CRISP-DM got the most votes. Figure 2: Poll about Data Mining Methodology from 2007 Considering the different methodologies from figure above, CRiSP-DM is the most well-known and generic method within the data mining community. Furthermore CRISP-DM has been developed in collaboration with a number of companies, while SEMMA for instance is developed by one company namely SAS institute. My own approach is again a very subjective approach, and would therefore not be useful in this research. My organizations methodology can also vary from company and industry and is therefore not appropriate for this research. Furthermore the CRISP-DM methodology also has a section about business understanding, which is different when compared to SEMMA. Based on these arguments, this is why the CRiSP-DM process model has been selected as our implementation method. The following chapters will walk through and explain the implementation method. 5.1 Data mining implementation methodologies in a SAP Predictive Analysis perspective SAP Predictive Analysis was first made available around 2012 and hence hasn t been around as long as other big vendors, however the capabilities and natural extension of SAP BusinessObjects BI portfolio seems to fit quite well according to Forrester Wave Q1, SAP AG 14

15 Figure 3: Forrester Wave: Big Data Predictive Analytics Solutions, Q In addition to supporting statisticians and professional data analysts, the power of SAP Predictive Analysis can be experienced by everyone in the organization by extending predictive functionality into business applications, business intelligence and collaboration environments, and onto mobile devices. SAP offers line-of-business and industry-specific applications powered by SAP HANA and its predictive functionality. Later in this research there will also be focus on the SAP InfiniteInsight software as it also a part of SAP SAP AG 15

16 6 Implementation approach using CRiSP-DM Using the CRiSP-DM implementation approach is normally considered a best practice within the data mining community. As explained earlier, the process model consists of different phases as shown below in the figure below. Figure 4: An overview of the CRiSP-DM model According to the CRiSP-DM process model (Chapman et al., 1999) the top-level knowledge discovery process consists of business understanding, data understanding, data preparation, modeling, evaluation and deployment. As illustrated, there is an iterative flow between the different phases. See the figure above. The structure behind the article is to follow the process and methodology quite rigorously so that this can be re-used in other scenarios within predictive analysis. The CRiSP-DM phases of business understanding, data understanding, data preparation, modeling and evaluation are covered in this article, however deployment is deemed outside this scope as it is more a project case-by-case deployed in a custom landscape and more differentiated than the first 5 phases. Each phase in the methodology contains a number of second-level generic tasks. These second level tasks are defined as generic as they can be used in different data mining situations. Furthermore these tasks are considered complete and stable, meaning that they cover the whole process of data mining and are valid for unforeseen developments SAP AG 16

17 Figure 5 shows the scope of this article with the phases and second level tasks from the CRiSP-DM model. Figure 5: Overview of implementation approach Looking deeper in the CRiSP-DM model, there is also third and fourth level specialized tasks, which explain how the generic tasks should be performed in certain situations. The third level can be considered as discrete steps or as a sequence of events which should be executed in a certain order. The fourth level can be described as actions, decisions and results of a data mining engagement. Examples of the third and fourth tasks will be explained later in each phase SAP AG 17

18 7 Business understanding The purpose of business understanding is to focus on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. This is achieved by performing the following second and third level tasks. These are shown in figure 6 for the business understanding phase. In the following subsection each of these tasks will be addressed. Figure 6: Tasks for the business understanding phase 7.1 Determine Business Objectives First of all, background information about the company s business situation is stated in this section. In this case, the research is based on data from a data mining competition from 2000, CoIL Challenge. The data is gathered from various insurance companies. The second step is to focus on the business objectives, what the customer s primary objective is and what they want to achieve. Furthermore, it s important to identify vital factors at the beginning of the project, which can influence the outcome of the project. The objective is to predict potential customers who want to buy a caravan insurance policy using predictive analysis. The last generic task is the business success criteria, what is the criteria for successful outcome of the project, and how is this measured. In this case the criterion is to predict as many potential customers who are willing to buy a caravan policy SAP AG 18

19 7.1.1 Determine Business Objectives for Insurance case The background case for this research is to predict and explain policy ownership, a direct marketing case from the insurance sector. The objective is to predict potential customers who want to buy a caravan insurance policy using predictive analysis. The success criterion is to predict as many potential customers who are willing to buy a caravan policy and possibly describe these customers and explain why these customers buy a caravan policy. 7.2 Assess situation In this task, the following topics will be addressed: resources, requirements, constraints and assumptions. This involves resource inventory such as available resources for the project, what kind of data is available and which software tools are applicable for the research. Furthermore, risk and contingencies will also be discussed Assess situation for Insurance case In our research, data from 2000 the CoIL Challenge has been used and is a requirement for the further analysis. It is assumed that the data provided is reliable and valid. No potential risks or events might occur to delay the project or make it fail. The terminology used in this research is explained earlier in section 5.1 terms and definitions. The main benefit for the insurance company is to predict the number of customers and to better target them. 7.3 Determine Data Mining Goals A business goal states objectives in business terminology; a data mining goal states project objectives in technical terms. For example, the business goal might be, Increase catalog sales to existing customers, while a data mining goal might be, Predict how many widgets a customer will buy, given their purchases over the past three years, relevant demographic information, and the price of the item. Describe the intended outputs of the project that enable the achievement of the business objectives. Note that these are normally technical outputs. Translate the business questions to data mining goals (e.g., a marketing campaign requires segmentation of customers in order to decide whom to approach in this campaign; the level/size of the segments should be specified). Specify data mining problem types (e.g., classification, description, prediction, and clustering) Determine Data Mining Goals for insurance case The overall data mining goal for the insurance company is to develop a predictive model that can help determine which of the existing customers is most likely to buy caravan insurance. The success criterion is to identify and predict as many customers as possible who are going to buy caravan insurance SAP AG 19

20 7.4 Produce Project Plan In the project plan, different stages in the project will be explained in order to achieve the business goals; this also includes duration, required resources, inputs, outputs and dependencies. The core part of the research is in the modeling and evaluation part where different predictive analysis techniques will be used. The duration of the entire research is a couple of weeks; with a main focus on data preparation and modeling. For the initial assessment we will use algorithms and best practice data mining techniques together with the SAP predictive analysis tool and perhaps also the use of SAP HANA for in-memory processing of the algorithms. For this insurance case with a sample of around 6000 training transactions and 4000 test/validation transactions executing the algorithms directly on the laptop with SAP Predictive Analysis installed should perform sufficiently. However, in a scenario with much larger data quantity the optimal solution performance-wise would be the usage of SAP HANA; moreover this would also allow the consumption of the trained models in HANA facilitated end user applications Produce Project Plan for insurance case Here is the project plan for the insurance case, see figure below for more details. Figure 7: Project plan example for Predictive Analysis project As shown this research is a 6 weeks project where the data preparation and modeling phase are the most time consuming tasks, each taking approximately 3 weeks. This time may be reduced by usage of automation such as proposed through SAP InfiniteInsight or SAP Rapid Deployment Solutions. In a Proof of Concept (PoC) scenario the deployment phase would not have a start/end time as this would not be a part of the scope. Demoes or PoC s should not be planned for production purposes but could of course be harvested for knowledge acquired during the workshops, business and data understanding, how each algorithm scored etc SAP AG 20

21 The influence of frontend-tools such as SAP InfiniteInsight and SAP Predictive Analysis: the CRiSP- DM based implementation approach is generic and in some situations where you are as fortune to have the data warehouse, data quality and business understanding is readily at hand it can of course cut the implementation time significantly. Allowing the data mining team to quickly produce actual results and even present findings in a very few days. SAP InfiniteInsight have capabilities that allow for a higher degree of automated approach in for instance selecting modeling techniques. However it is the experience of the authors that trying to cut out any phases without having gone through the process such as for instance the business or data understanding can lead to misinterpretations in the results produced by complex algorithms to answer a specific business question. Moreover if the data quality isn t sufficient it can also reduce the gain that was expected. Having said that some phases can be fulfilled more quickly than others all depending on factors such as the specific business question, how well the data is prepared and of course the experience that the data mining team is bringing to the projects. Using SAP Rapid Deployment Solutions (RDS) for predictive analysis designed for specific line of businesses would of course also allow for a faster implementation than starting from scratch. Moreover using an SAP RDS approach would allow the reusage of accumulated best practice SAP AG 21

22 8 Data understanding In this section the data in the research will be explained. The steps in this phase are first an initial data collection, followed by a description of the data, progressing with exploring the data and finally verifying the data quality, see figure 8. The overall purpose of this phase is to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. Figure 8: Tasks for the data understanding phase 8.1 Collect initial data In this task the acquired dataset will be explained, together with their locations and methods to collect the data in a project. The required dataset should be aligned by the earlier phases and listed in the project resources. This initial collection includes extracting and data loading. At this stage the data will be loaded data into SAP Predictive Analysis (or Lumira) to facilitate the data understanding phase. By describing these details, it will help others to replicate or perform similar projects in the future. The dataset consists of multiple text files: a master data and a file data, which were acquired from an Insurance company s real database. This data was delivered in the form of the previously described data mining competition from year The challenge was organized by the Computational Intelligence and Learning (CoIL) cluster, and cooperation between four EU funded research networks SAP AG 22

23 8.1.1 Collect initial data for insurance case The real data was first published with the CoIL challenge. See more about the data in the appendix data dictionary. You can collect your own copy of the data by visiting this site: Describe Data This task examines the surface properties of the acquired data. The data types of each individual variables also narrow the number of algorithms that can actually be used refer to the table Which Algorithm When that in a high level illustrates which algorithm can be feed with either categorical or numeric data (or both) Describe data for insurance case As previously mentioned, the data contains two data files. The data file features the actual dataset; this dataset contains 5822 customer records of which 348, about 6%, had caravan policies. Each record consists of 86 attributes, containing sociodemographic data (attribute 1-43) and product ownership (attributes 44-86).The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Attribute 86, "CARAVAN:Number of mobile home policies", is the target variable. All values in each record of the dataset are defined as numeric. The master data file contains a text based explanation of the 86 different attributes. The test or validation set contained information on 4000 customers randomly drawn from the same population. For model comparison purposes each model must be evaluated as to which of the 800 customers would be most likely to buy caravan policies. Meaning that when scoring a model the highest 20% (800 customers of in total 4000 in the test set) customers must be presented and evaluated for a true prediction. 8.3 Explore Data This task focuses on an overall querying, visualization and reporting of the data. This includes distribution of key attributes, simple aggregations, relations between attributes and simple statistical analysis which could suggest further examination. This contains an initial investigation of the data. From within SAP Predictive Analysis create a new document and select the data source that is appropriate for where the data is residing. See the figures 9-13 below SAP AG 23

24 Figure 9: Starting with a new model in SAP Predictive Analysis Figure 10: Creating a new document in SAP Predictive Analysis 2014 SAP AG 24

25 Figure 11: Attaching a data source for consumption in SAP Predictive Analysis Figure 12: Data pulled into SAP Predictive Analysis SAP Predictive Analysis allows users to reduce the number of columns coming into algorithms and also filter the number of truncations feed into the algorithms. Reducing number of columns and rows increases the transparency and increases the processing time while developing predictive workflows it is good practice to think big but start small SAP AG 25

26 8.3.1 Gaining data understanding Figure 13: Filtering Columns and Rows SAP Predictive Analysis The first step in SAP Predictive Analysis is to create a new model. Next present the data as shown below in figure 14 for visual interpretations and potentially enhancing data understanding: Figure 14: Data pulled into SAP Predictive Analysis 2014 SAP AG 26

27 Furthermore use SAP Predictive Analysis to visualize pair-by-pair attributes to determine any correlation. This is shown in various scatter matrix charts in figure 15. Figure 15: Predictive Analysis visualizing data Scatter matrix chart The next chart in figure 16 is the parallel coordinate s chart which uses a visualization technique to visualize multi-dimensional data and multivariate patterns in the data for analysis. In this chart, by default, the first five attributes are represented as vertically-spaced parallel axes. To select the subset of attributes to be viewed in the chart, use the Settings option in the right top corner of the parallel chart. Each axis is labeled with the attribute name, and minimum and maximum values for attributes. Each observation is represented as a series of connected points along the parallel axes. You can select the color by option to filter the data based on the categorical value. Figure 16: SAP Predictive Analysis (SP14) visualizing data - Parallel coordinates 2014 SAP AG 27

28 Does this uncover any obvious visual correlation between the attributes in the insurance dataset? Hard to tell but interacting with SAP Predictive Analysis moving ATTRIBUTES forth and back between the different attributes potentially reveal previously unknown information. As shown, SAP Predictive Analysis can help illustrate and understand the relationship between the attributes. Using SAP Predictive Analysis with the parallel charting capabilities reveals how to interpret the unknown information as presented in figure 17. High values in Attribute x results in high values in attribute y whereas attribute z has lower impact. Figure 17: Predictive Analysis (SP14) visualizing data - Parallel coordinates Visualizing data interactively with network analysis: Figure 18: Predictive Analysis (SP14) visualizing data network analysis 2014 SAP AG 28

29 In the following section of this article the different algorithms will help support the evidence reasoning when selecting and pruning the attributes needed to predict which customer will potentially buy additional insurances. The gained knowledge using SAP Predictive Analysis statistical functionality can be an important supplement on how one can manage and select attributes. The result from the exercise can, for instance, be used to validate, redesign and simplify the number of attributes needed to predict who will buy, click or churn (Siegel, 2013)). Evaluating and actually measuring the hypothesis of cause and effects could imply a higher possibility of getting the right measures and in essence the ability of being more proactive. 8.4 Verify Data Quality This task covers questions like is the data correct, are there missing values in the data and if so how often does it occur Verify data quality for insurance case Using the function Facets to illustrate the data granularity in SAP Predictive Analysis can provide valuable information as to data understanding. Figure 19: Screen shoot of the data in SAP Predictive Analysis As for our insurance case the field Number_of_mobile_home_policy we can see that 348 of the records are indicated as buyers and are not buying. The whole exercise can be focused on this very aspect. The predictive model should determine the patterns that lead to Buying so that we can predict whom will be most likely interested in buying a mobile home insurance policy. Figure 20: Results from SAP Predictive Analysis The quality of the data seems to be in good shape, without critical errors and many missing values. This enables us to use the data and proceed further with the research to the next section data preparation SAP AG 29

30 9 Data preparation The next phase is data preparation which covers all activities to construct the final dataset from the initial raw data. The underlying tasks in this phase are likely to be performed multiple times and not in any prescribed order. This phase is often considered as a time consuming part and consists of the following subtasks: selection of data, clean data, construct data, integrate data and format data. This is shown in figure 21 below. Figure 21: Tasks for the data preparation 9.1 Select data This task includes deciding what data both attributes and records to use for the analysis and also providing the reason for this inclusion and exclusion. Furthermore provide quality and technical constrains to set as limits on the data volumes and data types. In the case of the insurance example the data is already provided and as is without the ability to add further data. However in some cases factors can be concatenated or split up. As described earlier each record consists of 86 attributes, containing socio-demographic data (attribute 1-43) and product ownership (attributes 44-86) SAP AG 30

31 All the provided data is loaded into SAP HANA using HANAs importing functionality, see figure 22: Figure 22: Importing functionality in SAP HANA After loading there will be two tables in SAP HANA one for the training data (known buyers) and one for the validation (unknown buyers) the next step is to prepare the data further with respect to consumption in the SAP Predictive Analysis tool (e.g. clean the data and set the proper data types for each column). Currently SAP Predictive Analysis (version ) accepts data types of Integer, Double and String meaning any numeric values with Decimal or anything else must be converted. 9.2 Clean data The purpose of the data cleaning report is to raise the data quality. This consists of decisions and actions in order to solve the data quality issues in the data. Furthermore this could involve a selection of subset data or inserting of missing data in the data set. During import into SAP HANA the data is provided with the proper data type for each factor. In this case all data is either strings or integers. The model that will be built and described in the following chapters of this article could have been even more precise if the data had been less anonymized as it is the case. The numerical data is all grouped in ranges instead of having the exact value such as income and contribution which is aggregated into groups. In figure 23 the different data types for each column is shown below SAP AG 31

32 Figure 23: Data types of the columns in the data set 9.3 Construct data This subtask involves preparing the data which includes creating new attributes, generating new records or transforming values in the existing data. Explanation of this data operation will be stated in this part. In order to visualize results from the data mining trials the target column Number_of_mobile_home_policy is transformed into a string where the following values apply: 0 = Not Buying and 1 = Buying Normalizing data It can sometimes be beneficial for the data if to scale all numeric variables to the range 0 to 1 for equal input to a model, or by calculating a Z Score or using decimal Scaling. This is in particular the case when the variance of data is high and the intended algorithms to be used include distance calculations such as the Euclidian distance (for instance k-means). As the Euclidean distance is computed as a sum of variable differences, the outcome is highly influenced by the ranges of the variables. If the spread of values in variables range up to 100 times larger than the others, it will have a significant influence and dictate the value of the distance, merely ignoring the values of the other variables. If left un-normalized these variables would be merely useless in any algorithm using Euclidian or similar distances calculations. When normalizing the effect is that each of the variables can be compared fairly in terms of information content with respect to the target variable. In SAP Predictive Analysis there are 3 normalization built-in methods besides the possibility of using R functions to enhance normalization methods even further. In SAP Predictive Analysis SP14 the requirement is to select one column at a time per component. See figure SAP AG 32

33 Figure 24: SAP Predictive Analysis Normalizing data Construction of new variables In SAP Predictive Analysis one can create new variables based on two or more variables which may be more valuable in an analysis than the numerator and denominator separately, as shown in figure 25 below. Figure 25: SAP Predictive Analysis Construction of new variables 2014 SAP AG 33

34 Figure 26: SAP PA (SP11) Construction of new variables (MacGregor, 2013) Binning: When presenting findings for the domain experts binning of data can enhance the level of knowledge discovered by the algorithms. This is also known as parsimony. A continuous numeric variable can binned or grouped to lessen the possible outcomes. Binning is subjective, so various binning strategies should be considered. 9.4 Integrate data This subtask merged data involves operations such as joining tables together and executing aggregations of multiple records or tables. Integrating further data from other insurance companies (for instance to use affinity analysis to check if the person or family members have a buying history), public records & other big data could enhance the model even more. 9.5 Format data Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool Formatting data in SAP InfiniteInsight: When acquiring a new data set in SAP InfiniteInsight the tool has an automated detection capability that tries to guess the variables data type and value (ratio-type). After loading the data into SAP InfiniteInsight the button (Analyze) is pressed to guess the descriptions, as shown in figure SAP AG 34

35 Figure 27: SAP InfiniteInsight display of data types The automated analysis is there just to help users format data sets with thousands of attributes. Not automated process can infer meta-data from data, and this is why users should always carefully describe the data to be used by the modeling process. Here the meta-data descriptions were provided together with the data and one can present the SAP InfiniteInsight tool with the proper value (ratio type): nominal, ordinal, continuous ratio or textual data-type in the insurance dataset. As will be shown in a later section adding the value property to the insurance dataset increases the model scoring of SAP InfiniteInsight. This is a one-time operation and can be reused by saving the manually entered metadata. Figure 28: SAP InfiniteInsight saving metadata As a side note, the reader may notice that this is the first operation using SAP InfiniteInsight since the robust automation techniques used in this modeling tool avoid previous steps such as: Explore the data: Using InfiniteInsight, the descriptive statistics of the entire data set is part of the results of a modeling activity and is not required to be performed before hand. Furthermore, since InfiniteInsight modeling techniques are working in high dimensions, there is no need to perform a-prioir variable selection. Again, key inlfuencers are part of the results of the modeling activity and are provided to business users as a way to validate the model findings SAP AG 35

36 Data Quality: Since InfiniteInsight is based on robust techniques, there is no need to perform data quality checks prior modeling: no need ot impute missing values or to remove obvious outliers Construct Data: Whereas new business driven new variables can always be useful, there is no need to construct technical derived variables such as normalized variables or binned variables when using InfiniteInsight Format data for insurance case Preformatting the data can increase the productivity when data is shared in end user tools such as SAP Predictive Analysis. Formatting at the data source: As for SAP HANA modeling with the intent to consume the data in SAP Predictive Analysis, this is shown below in figure 29. Figure 29: Pre-formatting the data in SAP HANA There might be situations where you need to convert the original data into other data types. The data source table (in HANA or similar) would be built for most scenarios however algorithms can take different input variables SAP AG 36

37 Below in figure 30 is shown an example where a variable might need to be converted from numeric to string or simply from decimal to integer or double to fulfil the specific input or output requirements demanded by algorithm. As an example some decision tree algorithms only accepts integer data types for the target variable. Figure 30: SAP Predictive applying data to algorithms Algorithms are very individual and require that the variables are with specific data types. HANA R- CNR Tree accepts integers, doubles and strings SAP AG 37

38 Formatting in SAP Predictive Analysis: When acquiring a new data set in SAP Predictive Analysis the tool displays the data types and date format. Figure 31: SAP PA displays of data types With respect to reporting purposes, including the graphic illustration and statistical analysis, it would be preferred to use the scale type ratio and interval as these contain a higher degree of information. At a later stage in SAP Predictive Analysis each variable can always be binned into groups of meaning higher granularity but not the other way around. The various data used for the predictive models and algorithm should include all possible selection requirements for data as feeding the algorithms is based on one data source at a time. The data collection report should also define whether some attributes are relatively more important than others. Remember that any assessment of data quality should be made not just of the individual data sources but also of any data that results from merging data sources. Because of inconsistencies between the sources, merged data may present problems that do not exist in the individual data sources SAP AG 38

39 Figure 32: SAP Predictive Analysis using formulas to convert from string to numeric 9.6 Balancing data In cases where the data set is skewed, meaning when the data is highly unbalanced, algorithms tend to degenerate by assigning all cases to the most common outcome. If data is balanced, that reduces the training set size, and can lead to the degeneracy of model failure through omission of cases encountered in the test set. Decision tree algorithms were found to be the most robust with respect to the degree of balancing applied (Olson, 2006). As will be demonstrated in this article the Logistic Regression (HANA PAL) and Decision Tree algorithms perform very poor on the training dataset. However if the positive and negative (customers buying and not buying caravan insurance) customers are balanced, boosted or oversampled both the Logistic Regression and decision tree are able to predict a very high number of customers likely to buy additional caravan insurance. Exploring the training dataset reveals that the percentage of customers with positive matches is only around 5% and hence the majority has not bought any caravan insurances. The data can be made balanced by downsampling non-donated cases and/or oversampling donated ones, which might make it easier to build a predictive model and might produce better models (Zhao, 2013) The process of balancing or oversampling the training dataset could be performed in different ways in this article the authors decided to add copies of the positive matches to the training dataset instead of discarding negative matches SAP AG 39

40 10 Modeling This section describes the process of using statistical algorithms to build predictive models. This is shown in figure 33 below. Many data mining projects are, according to various (CoIL, 2000a) articles, performing less than expected due to the lack of business understanding and alignment. Figure 33: Tasks for the modeling A data mining project is more than just trying to apply >4000 algorithms in random. Selecting the right algorithm should be determined by the questions asked by the business people. Depending on what questions and data is at hand, this then leads to the selection of a proper range of algorithms that should be tested for the best possible fit. Below in figure 34 the different types of algorithms are shown. Figure 34: SAP Predictive Analysis different types of algorithms 2014 SAP AG 40

41 10.1 Select Modeling Technique. In short there are 3 steps to make sure that the modeling is performed in a controlled manner: 1. Select modeling technique (based upon the business questions asked) 2. Build model (choosing algorithm parameter settings if applicable) 3. Assess model (rank the models) Various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary. An overview of the different algorithms is illustrated in figure Describe Data for Insurance case In this insurance case the data is represented either as data type string (= symbolic in the table above) or data type integer (= numeric in the table above). The objective is to determine customers who could be classified as buyers of caravan insurance. The Multiple Linear Regression algorithm only takes numeric values as input and is then not applicable in this case. Logistic Regression only takes symbolic values as input and therefore not appropriate unless we treat the integer values in our data set as pure text. The reason for discussion the data types in modelling is directly linked to which algorithm that can be used in the end. Figure 35: SAP Predictive Analysis different types of algorithms The remaining modeling techniques are: Decision Trees & Neural networks from the SAP Predictive Analysis native algorithm library on top of that there is also the opportunity to enhance with algorithms based on the statistical language R. In the following, the Naïve Bayes algorithm will also be applied to our insurance data why? Well the Naïve Bayes algorithm has a proven track record of performing and scoring very well in a classification problem case such as this one. Before actually building a model, generate a procedure or mechanism to test the model s quality and validity. For example, in classification, it is common to use error rates as quality measures for data mining models. Therefore, typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set SAP AG 41

42 Figure 36: Predictive Analysis Which Algorithm When (MacGregor, 2013) 10.2 Determining correlation coefficients selection of algorithms & method Kreiner (2007) and Larsen (2006) provides multiple correlation coefficients to identify a possible correlation between two variables for instance rank correlations, which measures the strength of a monotonic relationship between rank scale variable; ordinal categorical variables; Kendall-T. Pearson's r for interval-scaled and normally distributed variables and Spearman r for interval-scaled, and where one of the two variables are not normally distributed. Below is a step-by-step method developed based on Kreiner (2007) and Larsen (2006) to determine the correlation coefficient which is suitable to describe the data in this Insurance data example: 1) To determine which method is suitable to evaluate the relationship between factors in the dataset, the data is assessed. It is assessed whether the data observed in the same groups two or more times (related variables), or whether it is two or more completely independent groups that are observed (independent variables). 2) The scale type is also one of the deciding factors, when the statistical method is selected. It should be examined which scale type that characterizes the data for each of the two variables to be included in the test, i.e. whether the data can be described on a nominal scale, ordinal, interval scale or ratio scale Nominal Scale The code values here are category names, and the only arithmetic operation is inequality relationship. The values cannot be ranked, and are exhaustive and mutually exclusive Ordinal Scale This is a scale, where the sequence of code values have a specific meaning and the recoding of variables must be maintained. The key limitation of ordinal scale is that the distance between two values are not given any concrete meaning. Sometimes code rules are applied to ordinal scale and increasing codes reflect growing tendency and the following interpretation of the analysis results could be ranked. Example 0: Never, 1: Sometimes and 2: Often SAP AG 42

43 Interval Scale This is a scale where the distance between the code values can be assigned a meaning. Examples of interval scales are measurement of time, distance, economy or speed where is makes sense to calculate the distance between two code values. This could for instance be temperature in Celsius or Fahrenheit Ratio Scale This scale can be assigned differences or a specific meaning. To make this possible, the scale should have a fixed point, which must not be changed and should be positive. The difference between the ratio scale and the interval scale is that the ratio scale is based on an unchanging zero. Examples could be age or temperature. In figure 37 is an overview of the mentioned scale types. Scale or value type Meaning / Comparison Characteristics Transcoding Nominal Equality / inequality The values cannot be ranked, exhaustive and mutually exclusive. All 1:1 re-coding Ordinal Equality / inequality Greater than / less than The values can be ranked, but distance between categories are not applicable All monotone (High, Medium, Low) recording Interval Equality / inequality Greater than / less than Can be ranked and the distance between categories gives information Change of zero and scale unit Difference Ratio also known as continuous Equality / inequality Greater than / less than Difference Can be ranked and the distance between categories gives information Fixed zero point Change of scale unit Ratio Textual (InfiniteInsight) Textual variables Textual variables that can contain phrases, sentences or complete texts Figure 37: Reworked upon scale unit from Kreiner (2007, page 19) 10.3 Generate Test Design The test design should describe the intended plan for training, testing, and evaluating the models. A primary component of the plan is to determine how to divide the available dataset into training, test, and validation datasets Generate Test Design for our insurance case The test design should be linked to the business question. In our case how well can we predict new potential customers to buy mobile home insurance? The goal is to have as high as possible scoring of the models. The test design will comprise use of training and validation data and the model will be evaluated using s confusion matrix to determine the hit/fail ratio of each model. Concerning the CoIL challenge, the test design was provided as the number of correctly identified caravan insurance policy buyers in the first 800 highest scores provided by the modeling technique SAP AG 43

44 This operational metrics is widely used in marketing since the goal is to sent the mailings or the offers to the smallest possible list of prospects with the highest probability to take the offer. It must be noted that the test design may impose some constraints that will discard some algorithms from the possible list of classification algorithms in our case. Only classification algorithms able to score and rank the prospects should be considered, and not the algorithms providing a simple decision which cannot be assigned a score or a probability Build Model The development of predictive models is usually an iterative process where different models are evaluate according to the test design and continuously optimized with respect to what is modeling objectives. Modeling -> Evaluation -> interpreting the results with domain experts -> data understanding -> data preparation -> modeling -> Some of the steps might be skipped depending on the findings and need for enhancements in data preparation, combining datafields, creating new calculations based on two or more existing fields etc. During the modeling the data scientists and predictive modeling experts should document the process so that it can be re-run with a different set of parameters. Further it should also document the reasons for choosing those values. Finally the model should be run in the modeling tool on the prepared dataset to create one or more models. The modeling process should include a description of any characteristics of the current model that may be useful for the future enhancements: Record parameter settings used to produce the model Give a detailed description of the model and any special features For rule-based models, list the rules produced, plus any assessment of per-rule or overall model accuracy and coverage. Describe the model s behavior and interpretation of the results State conclusions regarding patterns in the data (if any); sometimes the model reveals important facts about the data without a separate assessment process SAP AG 44

45 SAP Predictive Analysis R based Decision tree A good starting point for our insurance case is the decision tree which fulfills the requirements for a classification problem and it also provides valuable descriptions on the findings. Furthermore it can be used for identification of driving indicators (factors, variables etc.) and hence pruning insignificant indicators which can be leveraged in other algorithms. Predictive process-flow: Building the model on training data and running the trained model on validation data. Figure 38: Predictive process flow in SAP Predictive Analysis This data mining modeling process is as discussed earlier an iterative process where hypothesizes to answer the business questions are built into multiple each model is evaluated. The models will have different parameters, number of variables, different sets of sample or training-data etc SAP AG 45

46 The results of some algorithms allow great visualizations such as decisions tree shown below in figure 39. This could then be used to discuss with the business people with domain expertise. Figure 39: Evaluating the results of a decision Tree in SAP Predictive Analysis Using SAP Predictive Analysis s statistical explanation tab to identify the variable importance. SAP Predictive Analysis showing the details of the algorithm build on the training data, see figure figure 40. Variables are sorted in descending order after importance new feature in SAP Predictive Analysis SP 14. Figure 40: SAP Predictive Analysis algorithm summary 2014 SAP AG 46

47 Variable importance generated by the decision tree algorithm: Variable name Importance Variable name Importance Contribution_boat_policies Contribution_car_policies Contribution_fire_policies Contribution_private_third_party_insurance Customer_main_type Customer_Subtype High_level_education Home_owners Income_45_ Income_GT Lower_level_education Middle_management National_Health_Service No_religion Number_of_boat_policies Number_of_car_policies Number_of_fire_policies Number_of_private_third_party_insurance Other_religion Private_health_insurance Purchasing_power_class Rented_house Roman_catholic Singles Skilled_labourers Social_class_C Unskilled_labourers Number_of_houses Avg_size_household Avg_age Remaining variables Figure 41: Overview of the variable importance Meaning that instead of the 68 variables we will use this gathered information to prune the insignificant variables. Moving forward we will use the 27 variables identified as influencing the purchases of caravan insurances (mobile home) SAP AG 47

48 Confusion matrix is a way of visualization of the performance of an algorithm in supervised learning: Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. The confusion matrix is also called the contingency table or the error matrix. See figure below. Figure 42: Confusion matrix for Decision Tree in SAP Predictive Analysis SAP Predictive Analysis - Decision tree scoring the model After having trained the decision tree on known data the next step is to add a validation dataset and execute the trained model including the parameters. Saving the model in SAP Predictive Analysis is performed by clicking on Save as model on each algorithm. Figure 43: Saving the model in SAP Predictive Analysis 2014 SAP AG 48

49 Running the trained model can be found under Models, see figure 44. Figure 44: Trained models in SAP Predictive Analysis To run the trained model on a new (and unknown dataset) go to the prepare tab. From SAP Predictive Analysis add a new data source: To determining success-rate or scoring of the trained model apply the saved model to the validating dataset: Figure 45: Adding a saved model a dataset in SAP Predictive Analysis Below is presented the scoring of the trained model using SAP Predictive Analysis visualization capabilities. According to the Insurance company there are 238 customers who actual bought the insurance for mobile home SAP AG 49

50 Figure 46: Score of the trained model using SAP Predictive Analysis table form Figure 47: Score of the trained model using SAP Predictive Analysis graphic illustration 2014 SAP AG 50

51 Figure 48: Visualizing results in SAP Predictive Analysis According to the above table, the decision tree only found 38 of the 238 potential not even taken into account the test design where only the top 20 % of the most likely customers should be submitted for verification. Moreover, only 9 of these customers were identical to the ones identified by the actual insurance company that delivered the data. Of course the insurance company could be wrong and having missed some customers that were likely to buy additional insurances for their caravan. This experiment is left in the article to prove a point as will be evident later in the article an algorithm can perform very poorly on an unbalanced data set. However with a balanced data set, binning and boosting (oversampling) the very same algorithm suddenly can be among the top performers. Lesson learned: don t discard an algorithm before it has been tested toughly with a broad arsenal of data mining data preparation techniques. Based on this further optimization and perhaps other algorithms is needed in order to get a better scoring SAP PA with Decision Tree algorithm using boosted and binned data: To enhance the scoring of the Decision Tree algorithm two known data mining techniques are applied. This process requires a bit of extra work on the data preparation side, however could be reworded with a higher model prediction score. Two new fields are created with the cross product of Contribution Car Policies and Number of Car Policies and Contribution Fire Policy and Number of Fire policies. The field is still represented as an integer data type allowing SAP AG 51

52 Car_Policies: New cross product of Contribution_car_policies and number_of_car_policies Another possible positive effect of binning variables is the simplification of the Decision Tree Before: After: Boosting positive matches as described under using the Logistic Regression algorithm. Confusion matrix for training data with binning and boosting with the CNR-Decision Tree: 2014 SAP AG 52

53 Confusion matrix for testing data: With Binning of car and fire policy: Without Binning: The processing of the Decision Tree model showed only very marginal improvements using binning one extra customer was scored correctly. Optimizing the Decision Tree algorithm: 2014 SAP AG 53

54 SAP Predictive Analysis Logistic Regression (HANA PAL) Logistic regression is often used when dealing with classification problems and the dependent variable is binary (true or false, 0 or 1, buying or not buying etc.). In the case with predicting customer most likely to buy caravan insurance it is suitable in our case (either they want to buy or not), as we want to predict the binary variable indicting buying. During the experiments with SAP Predictive Analysis and various classification algorithms it became evident that our training data set is quite un-balanced. When using the training data as is the logistic regression performed quite poorly both on the training and testing part. After several iterations it became clear that if the positive transactions (customers buying) was balanced with the number of negative (customers not buying) the Logistic regression performed much better. It should be noted that balancing data in this manner shouldn t affect the data validity of model but only improve the predictions by a better training of the model. Balancing data set was described in section SAP Predictive Analysis Logistic Regression (HANA PAL) with balanced dataset. Data preparation: With all variables and balancing positive and negative matches. Training dataset: 348 positive matches and 348 not buying. Figure 49: SAP Predictive Analysis balanced dataset Resulting confusion matrix with SAP PA & HANA PAL Logistic Regression: Figure 50: Confusion matrix in SAP PA for Logistic Regression 2014 SAP AG 54

55 Testing the trained logistic regression model on the original testing dataset gives the following results. Figure 51: Results in SAP Predictive Analysis Sampling the 20% highest probable customers gives true predictions of 117 customers. Figure 52: Results of sample in SAP Predictive Analysis SAP Predictive Analysis Logistic Regression (HANA PAL), boosted dataset Data preparation: With all variables and boosting increasing positive matches with a multiplication factor of 14. Training dataset: records with positive matches and not buying. Figure 53: SAP Predictive Analysis Logistic Regression boosted dataset Confusion matrix for HANA PAL Logistic regression using training-sample boosted data set. Figure 54: Confusion matrix in SAP Predictive Analysis 2014 SAP AG 55

56 Testing the trained logistic regression model on testing dataset. Figure 55: Results in SAP Predictive Analysis Selecting the 800 customers with the highest probability using the fitted value to sort the most likely customers that was predicted reveals that the SAP PA Logistic Regression with HANA PAL can predict 117 of the 238 customers in the testing data set. Figure 56: Results of sample in SAP Predictive Analysis SAP Predictive Analysis Logistic Regression (HANA PAL), binning & boosted dataset Confusion matrix using both binning and boosting to enhance the scoring of the predicted customers. The experiments with binning is performed by creating a new variables directly in HANA Studio in this case by creating a cross product of the independent variables that could possible increase the scoring of the model. This is currently done manually with a variable by variable trial and error approach and as a side note it could be very interesting to have this functionality in the tools. Figure 57: Confusion matrix in SAP Predictive Analysis Logistic Regression Testing the trained logistic regression model on testing dataset: 2014 SAP AG 56

57 Figure 58: Results in SAP Predictive Analysis Log.reg. Binning & Boosting Sampling the 20% highest probable customers gives true predictions of 117 customers. In this case our binning of the chosen variables did not perform better than without binning. Figure 59: Results of 20% sample in SAP PA Log.reg. binning & boosting SAP Predictive Analysis allows the data scientist or business analyst to create highly visual appealing predictive end-to-end flows. As shown below in figure 60 both the training of the model and testing are visible on the same panel and even allowing saving data back to HANA for further collaboration etc. Figure 60: SAP Predictive Analysis (SP11) logistic regression model 2014 SAP AG 57

58 SAP Predictive Analysis - Neural network tree Using neural network to build a supervised model, see figure 61. Figure 61: Investigating neural networks in SAP Predictive Analysis Confusion matrix for Neural Network (R-NNet) using training data set. A neural network is frequently described as a black-box data mining method, because the reasoning behind the results is not readily available. Neural networks typically accept numeric data, so non-numeric data must be transformed. Decision trees handle missing values directly, while regression and neutral network models ignore all incomplete observations (observations that have a missing value for one or more input variables). Figure 62: Confusion matrix for Neural Network in SAP Predictive Analysis 2014 SAP AG 58

59 SAP Predictive Analysis Neural Network boosted data. Using the same approach as described earlier with boosting data: Confusion matrix testing data for Neural Network: Figure 63: Investigating neural networks in SAP Predictive Analysis Confusion matrix for testing data: Figure 64: Investigating neural networks in SAP Predictive Analysis The neural network is capable of predicting 105 of the customers using a sample of 992 customers. It would have been beneficial for the comparison if the NNet could be supplemented with a probability factor so that we could select the top 20% customers most likely to buy SAP AG 59

60 SAP HANA PAL natively using HANA Studio. Using HANA Studio to programmatically train and test models requires in the author s views a different skillset than when using either SAP PA or SAP InfiniteInsight. When selecting the predictive toolbox it should be evaluated in terms of tradeoff between cost generating the models and profit gained. For the data scientists or business analyst his/hers productivity and ability to produce a higher scoring points and control of the data mining process is related to the data mining tool. In the following the CoIL challenge is addressed using SAP HANA natively through the HANA Studio interface and the HANA Predictive Analysis Library. An alternative approach to coding could be to use the SAP HANA Application Function Modeler (AFM as of HANA SP06) which through a visual interface can assist to produce the needed SQL Statements to leverage the PAL algorithms. The principle of build predictive models in SAP HANA Studio is similar to the ones described in SAP PA and SAP InfiniteInsight. SAP HANA Application Function Modeler: The HANA PAL C4.5 & CHAID Decision Trees for the full code used please refer to Appendix using SAP HANA PAL natively at the end of this article. As shown in the comments within the code first the model is trained on our CoIL training dataset and secondly the trained model is run on the testing dataset SAP AG 60

61 The HANA PAL CHAID based Decision Tree. Using SAP HANA Studio to interact with SAP HANA and the predictive analysis library one has to write some Structure Query Language (SQL) base code. The full code is in the appendix of this article. For the purpose of optimizing the trained model generated using the training dataset the following algorithm specific parameters can be leveraged. Please note this is not an extensive list refer to the SAP HANA PAL documentation for a full list of possible parameters per algorithm. -- #PAL_CONTROL_TBL; INSERT INTO #PAL_CONTROL_TBL VALUES ('PERCENTAGE',null,1.0,null); The 'PERCENTAGE' parameter specifies the sample size of input data training data. In our case the training dataset is separated from the testing dataset and we chose 1 equal to 100% of our data. INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER',2,null,null); Number of threads used for optimizing execution time in SAP HANA as shown later in this section the execution time is less than a second with the default value. INSERT INTO #PAL_CONTROL_TBL VALUES ('MIN_NUMS_RECORDS',1,null,null); The 'MIN_NUMS_RECORDS' specifies the stop condition: if the number of records is less than the parameter value, the algorithm will stop splitting. In the following the effect on the predicted number of customers is evaluated with different values of the parameter. In theory the 'MIN_NUMS_RECORDS' parameter should be optimized to reduce over fitting. Scoring the model with different approaches for splitting condition with default 1: When splitting condition is set to 500: INSERT INTO #PAL_CONTROL_TBL VALUES ('MIN_NUMS_RECORDS',500,null,null); When splitting condition is set to 10000: When splitting condition is set to 600: 2014 SAP AG 61

62 When splitting condition is set to 800: The best performing HANA PAL CHAID algorithm is able to correctly predict 108 customers of the 238 in total based on the 800 customers most likely to buy. We could have gone a bit further here and build a more advanced SQL Script that would loop through a range of values for the parameter 'MIN_NUMS_RECORDS' and store the scoring results into a table. The scoring table could then be evaluated and the most optimum performing value could easily be identified automatically Scoring the model SAP HANA PAL C4.5 DecisionTree: Using the same procedure as described in the section on SAP HANA PAL CHAID Decision trees. In fact the only change that has to be made to the SAP HANA PAL SQL Script is the algorithm name called by the AFL Wrapper (application function library). CHAID Decision tree: call SYSTEM.afl_wrapper_generator('PAL_CREATEDT', 'AFLPAL', 'CREATEDTWITHCHAID', PAL_CHDDT_PDATA_TBL); C4.5 Decision tree: call SYSTEM.afl_wrapper_generator('PAL_CREATEDT', 'AFLPAL', 'CREATEDT', PAL_CHDDT_PDATA_TBL); The HANA C4.5 model is able to correctly predict 97 customers of the 238 in total based on the 800 customers most likely to buy. One of the key advantages of SAP HANA predictive analysis library (PAL) is that the algorithms are processing directly in-memory and hence the performance is very impressive. As an example processing the Decision trees takes less than a second. The HANA PAL can currently be utilized either from SAP Predictive Analysis (using HANA online and PAL specific algorithms) or as shown here directly in SAP HANA Studio SAP AG 62

63 SAP Predictive Analysis correlation matrix to determine important variables: Using the data mining algorithm Correlation matrix is a way of visualization of the performance of an algorithm in supervised learning. For complete walkthrough of the Correlation matrix algorithm - see reference R Correlation matrix algorithm at the end of this article. Figure 65: SAP Predictive Analysis Correlation Matrix As can be deducted by the correlation matrix our target Number_of_mobile_policies has a relative high correlation with the Contribution_car_policies, One_Car and Social_class_A. This information can be used to prune unnecessary variables in our data set and focus on the ones driving our target. For a description of correlation please refer to appendix Evaluation of strength of correlation-coefficient. Analysis and visualizations such as the correlation matrix assist the data scientist or business analyst in answering the 2 nd part of the CoIL challenge explaining the cause and effects of the variables SAP AG 63

64 10.5 SAP InfiniteInsight building model Using SAP InfiniteInsight to build a predictive model with targeting the caravan insurance, Number_of_mobile_home_policies : Figure 66: Selecting variables in SAP InfiniteInsight SAP InfiniteInsight is a very intuitive tool set and can quickly produce high level predictions as shown in the following sections of this article. Furthermore it also has additional functionality that allows calculation of which actual customers to choose in a scenario where the cost of contacting a customer is known. In SAP InfiniteInsight the selection of the default parameters is performed using the build in RobustRegression algorithm SAP AG 64

65 Figure 67: Modeling parameters in SAP InfiniteInsight SAP InfiniteInsight presents a high level summery of variables, number of records, algorithm-area (engine name) etc. before the model is executed. Training the model with the customers data set takes only 12 seconds using a local CSV data file. See figure 68. Figure 68: Training the model 2014 SAP AG 65

66 SAP InfiniteInsight presents a summary of the findings from training the model. As can be seen approximately 6 % of the customers are buying and 94% are not buying which is in line with the expected results. More interestingly is the Predictive Power (KI), Prediction Confidence (KR) and the number of variables that SAP InfiniteInsight uses in the trained model. KI is the abbreviation for KXEN Information Indicator and is the so called quality indicator of the models. More information on these KPI s can be found in the help area of SAP InfiniteInsight. It must be noted that you never need to balance the data sets for classification task when using SAP InfiniteInsight. The next step is to use the trained model on the test data set. In the following two possible ways of scoring a model using SAP InfiniteInsight will be used. Firstly the build-in capability of producing Gain Charts and secondly the use of InfiniteInsights 'Decision' mode SAP InfiniteInsight Using Gain chart and default data description. Using SAP InfiniteInsight Gain Chart to quickly provide the actual scoring is very fine aligned with the proposed test design where the goal is produce a list of the 20% most likely customers to buy caravan insurance. As shown below in the screen shoot InfiniteInsight can produce a Gain Chart with quantiles of 10% in this case which makes it very easy to select the top 20% or the first 800 of the predicted customers most likely to buy caravan insurance. Quantile 1: 67 customers + 43 customers from quantile 2. Total 110 correctly predicted customers. Figure 69: Gain Chart in SAP InfiniteInsight guessed metadata 2014 SAP AG 66

67 SAP InfiniteInsight Using Gain chart and actual metadata description. Running the same model but this time where the actual metadata has been added. Quantile 1: 70 customers + 48 from quantile 2 which in total makes 118 correctly predicted customers. Figure 70: Gain Chart in SAP InfiniteInsight actual metadata The data is scored using the Confusion matrix methodology to display the data after adding actual metadata description (Value or ratio-type): 2014 SAP AG 67

68 SAP InfiniteInsight Using Decision Mode and default metadata description. This section using SAP infiniteinsights Decision Mode is just a supplement as the Gain Chart method in line with the test design has already provided a solution. This section is to be regarded as a supplement in situations where the gain chart method cannot fulfill the requirements. To score the trained model on the test dataset one is directed to the Classification Decision panel which allows to choosing the decision threshold for every binary target before applying the model in 'Decision' mode. This decision will appear for each observation in the model apply results file. Figure 71: Results of the training model The decision is made through one of the following user-friendly decision criteria: Score Threshold: indicates the score used to separate positive observations from negative observations. Expected Percentage of Detected Target: detects x% of the target population. Expected Percentage of Targeted Population: detects x% of the whole population. (SAP InfiniteInsight 6.5.2, build-in Help description) Depending on the business question this is where the data scientist must engage with the business to determine the threshold. A low % of the population would in general terms mean that potential customers would be missed and a high percentage would imply contacting a lot of customers whom would not wated to buy. Next step in SAP InfiniteInsight is to set the population to 20%. As described in the COiL assignment the task is to present a prediction of the 20% customers most likely to buy caravan insurance? 2014 SAP AG 68

69 Preparing to run the trained model on the 20% test dataset is shown figure 72: Figure 72: Selection of 20% of population Scoring the SAP InfiniteInsight generated model with actual buyers versus predicted buyers with a sample of the 20% most likely customers. After applying the trained model on the test dataset the SAP InfiniteInsight predicted data is presented and allows for further investigation or export to text file. Figure 73: Applying the model The automated approach using SAP InfiniteInsight to guess the variable (ratio types) and having 20% default population setting in the Classification Decision screen. The data is scored using the Confusion matrix methodology to display the data before adding metadata description: 2014 SAP AG 69

70 SAP InfiniteInsight Decision mode and actual metadata description. Running the same classification model but this time after adding actual metadata description. The data is scored using the Confusion matrix methodology to display the data after adding data description (Value or ratio-type): The automated approach with value description added was able to predict 119 customers of the total 238 using 809 of the highest likely customers. With the default setting in SAP InfiniteInsight Classification Decision Panel (~5 % of the population) it was able to predict 36 customers with a higher accuracy as shown in the evaluation section of this article SAP AG 70

71 SAP InfiniteInsight Gaining additional insights from build models. Additional findings produced by SAP InfiniteInsight; Contribution by Variables is a very descriptive and visual presentation of the driving indicators. This is presented in the following illustration: Figure 74: Presentation of the contribution of variables This chart presents the driving variables in order of significance. As shown from the chart the variable Contribution_car_policies is the single most influencing variable as for the probability of having a customer buying additional caravan insurance. The data scientist and business analyst could then go a step further and analyze the effect of the variable Contribution_car_policies to determine the actual values using Category Significance. Figure 75: Presentation of the category significance 2014 SAP AG 71

72 As can be seen on the figure 76 using the Category Significance, it is clear the if the Contribution_car_policies is between 6 and 8 it is highly pointing to a customer with caravan insurance and vice versa for values between 0 and 5. SAP InfiniteInsight also allows the predictive model developer the freedom of overruling the chosen variables and select the ones of interest in the Selecting contributory variables. Figure 76: SAP InfiniteInsight selecting contributory variables 2014 SAP AG 72

73 Receiver operating characteristic (RoC) charting: Figure 77: ROC chart in SAP InfiniteInsight In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied Presenting the findings in an intuitive decision tree: Figure 78: Decision Tree in SAP InfiniteInsight The Decision Tree in SAP InfiniteInsight allows for visual interpretation of the driving indicators and could be used for discussion with the business people. Further possibilities with SAP InfiniteInsight: Using the Maximize Profit functionality would allow for more in detail identification of the relevant customers to address with a marketing campaign etc. for instance adding the cost of contacting a customer versus the probability of acquiring this customer and harvesting the revenue SAP AG 73

74 SAP Predictive Analysis - Custom R scripts Naïve Bayes: First iteration with the use of Naïve Bayes choosing all 85 variables except the dependent variable Number_of_mobile_home_policies (caravan insurance). Using Naïve Bayes on balanced training data set: Using the R script based Naïve Bayes algorithm it is predicting a significant part of the potential customers in the testing data set, however that is on the cost of a lot of false positives. Second iteration with the use of Naïve Bayes choosing only 10 out of the 85 variables that the Decision Tree algorithm identified as driving indicators of the Number_of_mobile_home_policies (caravan insurance). 1. Customer subtype: Average family 2. Customer subtype: Career Loners 3. Customer subtype: Farmers 4. Other relation 5. High level education 6. Medium level education 7. Social class A 8. 2 cars 9. No car 10. Average income Figure 79: Naïve Bayes in SAP Predictive Analysis 2014 SAP AG 74

75 This is now resulting in the following confusion matrix: Going further and manually adding another variable such as for instance Purchasing power class - in essence a manual stepwise regression approach. Adding all variables to the Naïve Bayes algorithm: This result in a scoring which shows that more predictions are matched but also more false positives exist. The automated approach was able to correctly predict 329 of the 348 buyers in the training data set. Saving the model and running it on the validation set provides the following results: To conclude anything there is a need for further analysis of the Naïve Bayes algorithm to compare with the COiL 2000 challenge business question of predicting the 20% of the highest probable customers. Specifically the R script embedded in SAP Predictive Analysis must be enhanced so that it can add a probability per predicted customer / transaction. However as we will illustrate our model was clearly an over-fitting the model. One of the major findings during the writing of the article is to combine both SAP InfiniteInsight & SAP Predictive Analysis. Using SAP InfiniteInsight to filter out (prune) insignificant or noisy variables and using this optimized dataset together with SAP Predictive Analysis proved a powerful combination as will be demonstrated in the following. The pruning of insignificant variables could also be achieved, however not as automated as SAP InfiniteInsight, using SAP Predictive Analysis and the R based correlation matrix algorithm with the proper statistical distribution - such as Pearson, Kendall or Spearman SAP AG 75

76 Re-modeling in SAP Predictive Analysis with the variables detected by SAP InfiniteInsight After using SAP InfiniteInsight to prune the number of variables the Naïve Bayes algorithm is capable of correctly identifying: 92 of the 238 potential customers in the validate dataset. Figure 80: Selecting variables in NaiveBayes in SAP Predictive Analysis Figure 81: Results in SAP Predictive Analysis The automated approach was able to correctly predict 154 of the 348 buyers in the training dataset. Using only 10 variables of the dataset this could imply a significant simplification of the needed data acquisition, preparation and quality control compared to having to maintain all 85 variables. Figure 82: Results after pruning the dataset in SAP Predictive Analysis 2014 SAP AG 76

77 Advanced data mining with ensemble models SAP PA. Previously in this article we demonstrated that our experiments revealed the best scoring algorithms, based on the Classifier Confusion Matrix, there are several algorithms performing equally well. The question is which model gives the most comprehensible rules? There is also the option of combining the predictions of models and for a classification variable, use voting for example as the choice of the most popular outcome. This is referred to as ensemble modeling and has been shown to produce more robust models. The data mining ensemble models could be implemented in many different ways. Such as letting the different algorithms vote on which customers to predict as buying customers. The wisdom of crowds concept motivates ensembles because it illustrates a key principle of ensembling: Predictions can be improved by averaging the predictions of many. [Abbott, 2012] The hypothesis behind our approach is the assumption that the best predictions of both algorithms could be leveraged using the probability, or expected validity, produced by the algorithms. The intent is to try selecting the best or premium predictions of both the Logistic Regression and the CNR-DecisionTree algorithms customer by customer. As described in the section on Logistic Regression we trained the model using the train dataset and ran the trained model on the test dataset. The predicted customers were written to a HANA table as shown below. SAP Predictive Analysis creates a table in HANA and includes both the Predicted Value and a FittedValue (probability) that we will use in the following SAP AG 77

78 As described in the section on CNR-DecisionTree we trained the model using the train dataset and ran the trained model on the test dataset. The predicted customers were written to a HANA table as shown below SAP Predictive Analysis is capable of creating tables in HANA regardless of the data source. When using data sources based on other sources than SAP HANA the process is a bit longer as you have to point at the specific SAP HANA JDBC driver as shown below, enter the HANA box name (grayed out) and port number etc SAP AG 78

79 SAP PA writes back the CNR-Decision Tree predicted data which includes both the Predicted Value and a FittedValue (probability) that we will use in the ensemble model. Looking at the gain chart for the CNR- Decision-Tree and the calculations supplementing the confusion classification matrix we see that the accuracy is as expected decreasing when selecting more customers. Focusing only on the actual customers in the test dataset we see that the curve has a good hit-rate up to choosing 800 customers where after it seems to flatten out. Which in the end implies that more customers must continuously be selected to find an actual customer interested in buying. In a data mining scenario it is typically interesting to know what the price of finding the next customer and to stop before the cost of finding a customer exceeds the profit generated by the customer. Our selected process is to create two tables with the predictions per algorithm together with the fitting (probability) and union these two tables as shown below. Thereafter selecting the 800 unique customers, removing duplicates in the case where the two algorithms found the same customer, and sorting the probability in descending order SAP AG 79

80 Figure showing the union of decision tree and logistic regression prediction results with actual target and predicted classification. Statement to select the 800 most likely customers to buy caravan insurance from ensemble of Decision tree and logistic regression prediction results: select top 800 CUSTOMER_ID, MAX(Probability) as MAX_PROB from ENSEMBLE where PREDICTED=1 group by CUSTOMER_ID order by MAX_PROB DESC 2014 SAP AG 80

81 This new prediction data set is then compared with the actual validation dataset. select T0."CUSTOMER_ID", T0."TARGET", T0."PREDICTED", T0."PROBABILITY", T0."ALGORITHM" from "I070917"."ENSEMBLE_800_CUSTOMERS" T1 inner join "I070917"."ENSEMBLE" T0 on T1."CUSTOMER_ID" = T0."CUSTOMER_ID" and T1."MAX_PROB" = T0."PROBABILITY" Scoring the ensemble model with the actual target from the test dataset: For purely academic interest we can divide the results up onto which algorithms predicted how many customers and the actual target: Confusion matrix for the ensemble of Logistic regression and decision-tree models: Selecting the 800 customers with the highest probability using the fitted (probability) value to sort the most likely customers that was predicted reveals that the combined SAP Predictive Analysis Logistic Regression with HANA PAL and SAP Predictive Analysis CNR-Decision Tree can predict 119 of the 238 customers in the testing data set SAP AG 81

82 Note: the different algorithms could produce probability or fitted values in different ways and comparing these should be investigated further in a case-by-case situation Advanced data mining with ensemble models SAP PA & SAP InfiniteInsight. Using the same approach as described with Ensemble of models but this time with ensemble using the voting principles and predicted data from SAP InfiniteInsight combined with SAP PA. The applied voting principle is in essence a democratic process where all the chosen algorithms scoring results are combined and the customers with the most votes are selected. Each model in the ensemble vote is given equal weight. Adding the scoring results from SAP InfiniteInsight to the scoring produced by SAP Predictive The results produced by SAP InfiniteInsight are written into a new HANA table. This table is then combined with the previously produced scoring results from SAP PA using UNION as shown below: --- Adding ENSEMBLE MODELs Dtree, LogReg and InfiniteInsight DROP VIEW ENSEMBLE_RESULTS_LOG_DTREE_II; create view ENSEMBLE_RESULTS_LOG_DTREE_II AS (select T0."CUSTOMER_ID", T0."TARGET", T0."PREDICTED", T0."PROBABILITY", T0."ALGORITHM" from "I070917"."ENSEMBLE_INFINITEINSIGHT" T0 UNION select T0."CUSTOMER_ID", T0."TARGET", T0."PREDICTED", T0."PROBABILITY", T0."ALGORITHM" from "I070917"."ENSEMBLE_RESULTS_LOG_DTREE" T0); This new combined prediction dataset is then summaries and sorted with the 800 customers with most votes SAP AG 82

83 As can be seen on the table above all the selected algorithms agree on the top first 614 customers after which the algorithms start to produce different predictions. It should of course be noted that the element chance play a part here in that more customers than just those ranked 614 to 800 has two votes from the algorithms. The ensemble of models using the voting principle is able to correctly predict 121 customers of the 238 in total based on the 800 customers most likely to buy. Confusion matrix for the ensemble of Logistic regression and decision-tree models: The ensemble of models would be a very interesting capability to have as an automated and built-in functionality of the predictive tool-set. This would increase the productivity of the data scientists working with more advanced modeling SAP AG 83

84 Advanced data mining with ensemble models with voting and sorting. With the previous ensemble model but this time enhanced with a point system. Each algorithm is giving a customer points according to firstly if the customer was predicted as a buyer and secondly the probability. The points are given as shown below sorting predicted and then probability. This process is performed for all 3 algorithms and they are then brought together for final scoring. The 800 customers with the best points are selected and scored. The ensemble of models using the voting principle is able to correctly predict 120 customers of the 238 in total based on the 800 customers most likely to buy. Using the ensemble model approach to select the top 800 customers could of course be disputed and discuss just as a blind chicken eventually will find a corn. However it should be noted that all 3 different approaches of ensemble models produces better results than the individual algorithm SAP AG 84

85 Model comparison high level summary of algorithm performance For the use of a classification problem described in this article multiple other algorithms could have been put to the test so please note that this list is not a complete list but the choice of the authors. Figure 83: Model comparison matrix - summary How to start interpreting the model comparison matrix: As an example with SAP InfiniteInsight default metadata which produced its prediction based on the 20% population with the highest probability as set in the Classification Decision panel. This transfers to 800 customers that SAP InfiniteInsight is selecting with the highest probability of being customers willing to buy additional caravan insurance. The 800 predicted customers are then compared to the actual result set given by CoIL and after reviewing the result is as follows: 118 positive matches were predicted as actual customers and the remaining 682 predicted customers are false positives. TP = True Positives, FN = False Negatives, FP = False Positives and TN = True Negatives. The accuracy (AC) is the proportion of the total number of predictions that were correct: AC = (TP+TN) / (TP+FN+FP+TN) Finally, precision (P) is the proportion of the predicted positive cases that were correct. P = TP / (TP + FP) Sensitivity (S), which measures the model s ability to identify positive results: S = TP / (TP + FN) The accuracy determined using the Accuracy (AC) equation may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases. Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. This is the reason that we decided to use the Precision (P) equation together with the Accuracy in order to give more balance to the model comparison. From the results of the actual COiL challenge this chart was published depicting each of the participant s actual submissions. Further is added the results from the 3 top scoring algorithms in this article SAP AG 85

86 In the figure below our five best scoring approaches and algorithms are shown with different colors. Number 1 (Blue) is SAP Predictive Analysis with Naïve Bayes algorithm, 92. Number 2 (Orange) is SAP InfiniteInsight without optimizations on the Value (Ratio type), 110 Number 3 (Red) is SAP Predictive Analysis with R-CNR Decision Tree, 116 Number 4 (Purple) is SAP Predictive Analysis with Logistic Regression (HANA PAL), 117 and Number 5 (Green) is SAP InfiniteInsight with the true meta-data information, 118. Number 6 (Black) is SAP Predictive Analysis and InfiniteInsight ensemble of premium models for logistic regression, CNR-decision tree and InfiniteInsight Classification, 121. Figure 84: Frequency distribution of prediction scores However this result should also be seen in conjunction with the number of false positives as depicted on the model comparison matrix. Moreover the SAP PA & Naïve Bayes in our data mining experiments only used 706 customers compared to the other algorithms that used 800 customer samples to provide a result of 92 correct customer matches. The actual results from the COIL 2000 competition which algorithm predicted the highest number of potential mobile home insurance - accuracy method by algorithm: 121 Naive Bayes (with optimized dataset; binning etc.) 115 Ensemble of 20 pruned Naive Bayes 112 GA with numerical and Boolean operators 111 SVM regression 111 SVM regression 96 Subgroup discovery 96 Decision tree 80 Fuzzy rules 74 CART 72 Neural nets For the complete description please check out reference: (CoIL, 2000a) SAP AG 86

87 11 Evaluation The evaluation phase investigates how well the model meets the business objectives by analyzing how the model performed on the test data. Furthermore, this phase examines the data mining results, the selection process, building and assessing the modeling techniques. Finally, determine the next steps on how to finish the project by listing actions and making decisions for the next phase: deployment. The tasks for this phase are shown in figure Evaluate Results Figure 85: Tasks for the data preparation In this section the test results are evaluated. This involves determining whether the model is efficient or not. This task also consists of unveiling additional challenges, information or hints for future directions. As presented in the modeling section using SAP InfiniteInsight to detect the influencing variables to detect potential customers for new mobile home insurance. These variables are feed into a Naïve Bayes algorithm inside SAP Predictive Analysis and without any further optimization 92 customers are detected of the potential 238. This would in terms of the CoIL competition already place the model in a decent position compared to other submitted models which ranged from Using SAP InfiniteInsight with the 20% population it can predict staggering 110 customers and using the optimized data description (value or ratio-type) increases the prediction to 119 placing it in the very top of the submitted prediction scores Review Process This task contains an in-depth review of the model to check if any important factors have been overlooked, examine if the model has been built correctly and furthermore to explore the quality of the model Determine Next Steps Here it is decided how to proceed to the next phase (e.g. deployment) and if the project requires additional iterations to progress further. This also includes analyses of remaining resources and budget which has an impact on the project decisions SAP AG 87