Using Data Mining for Rate Making in the Insurance Industry Written by SAS Institute June 2003
Using Data Mining for Rate Making in the Insurance Industry 1. Introduction 1.1 Trends in Insurance and Rate Making 1.2 Introduction to Rate Making 1.3 Rate Making: Data Concepts, Terminology 2. Data Mining for Rate Making 2.1 Introduction to Data Mining 2.2 Introduction to Enterprise Miner 2.3 Rate Making Using Data Mining 2.4 Rate Making Using Traditional Statistics 3. Finding a new Rate Structure 4. Refining an existing Rate Structure 5. Conclusions 6. References Appendix A. An Introduction to Statistical Rate Making B. Code for generalized Linear Models estimating Accident Frequency C. Enterprise Miner TM - SAS Institute's Solution for Data Mining
Abstract Given the current state of the insurance landscape, carriers competing for customers, dealing with rising claim rates and high loss ratios, the need has never been greater for insurers to establish and maintain a complete view of their current and prospective policyholders. A key part of this complete view is the specific claims risk that each customer poses to the company. Evaluating claims propensity gives insurance providers an objective way to adjust premium rates based on relevant, customer-specific information. Accurately predicting claims propensity throughout an organization can dramatically improve loss ratios and enables providers to charge lower premiums to low-risk customers. To improve loss ratios through more effective pricing and marketing, insurance providers need a solution that gathers relevant claims data from every corner of the enterprise as well as third-party data sources to provide both claims propensity predictions and claim size predictions.
1. Introduction 1.1 Trends in Insurance and Rate Making The insurance industry is undergoing profound changes as a result of globalization, deregulation, and technology. In essence, it is moving from a product oriented industry to a customer oriented service provider: Deregulation allowed non-insurance companies, such as banks, even retailers, to enter the insurance market and compete against the established firms. At the same time, deregulation also allowed insurance companies to enter different financial markets, such as investment banking, requiring the development and marketing of new products and identification of potential customer groups. Globalization accentuated this change, insurance companies are now facing competitors from many parts of the world, many of which move aggressively into new markets trying to attract customers with low rates and improved services. Technology further added to the momentum of change, mostly due to different aspects of computerizations, such as general access to computers, the Internet, and software to support business operations. General access to computers allowed a decentralization of computer resources within a company and enabled different business units and departments to develop many more business models, product variants, customer segments than was possible before. As a result, more and more emphasis shifted from the traditional product portfolio to innovative products targeting selected customers segments. The Internet providing direct access of customers to the company, now requiring the company itself, not just their agents, to directly deal with the customer, with many individual information requests, and with the need to manage prospects and customers effectively. The Internet is now also enabling competition from a new breed of insurance organizations through E-business, much leaner, with fewer agents, if any, with fewer branch offices, if any, but quite able to compete word wide for customers. The result again is an increased focus on the customer, his needs, and the tailoring of insurance policies to fit those needs on a profitable basis for all concerned. This trend toward an individualization of services and policies is supported and implemented by increasingly sophisticated business software. Relevant is storage and retrieval of customer, policy, and business data in databases and data warehouses and the use of these data in data mining systems. Customer data do not simply consist any more of the data provided by the customer when filling in an application form, but are enhanced with any data considered relevant, limited by ingenuity, availability and legislation. Common enhancements are data from other business transactions carried out within the organization, such possible investment activities, or other life/non-life insurance contracts, possibly including other companies, demographic information, professional and personal information, and where permitted general financial, incl. credit information, travel information, conceivably anything left behind as electronics trails (credit card transactions). This potentially vast collection of data on customers and prospects can be analyzed with data mining software for a variety of business concerns, some of which are target marketing, cross-selling, fraud detection, customer relationship management, rate making. In target marketing, the purpose is to identify the group of prospects that are most likely to be interested in a particular product. A common approach uses so-called "predictive data mining". Based on available data of customers, who have or have not bought the product in question, the most important characteristics distinguishing these groups are determined, using one of several modeling tools. These characteristics are then used to score the likelihood that a prospect will acquire the product. Often, the data are not already available, but need to be obtained in one or several test campaigns. Also care needs to be taken, that only those data are used that will be available for the prospects to be scored. In cross-selling, the purpose is similar, except we are attempting to sell an additional product to already established customers. In the insurance industry this turns out to be important for bonding the customer: a customer is less likely to let a policy lapse if he owns more than one product from the company. In addition to predictive data mining, there are other techniques that can be applied here, for example association analysis. In this type of analysis we could determine which types of insurance products frequently "go together", e.g. occur frequently in our customer database. In addition to frequency there are also other measures available that assess the affinity among such products. In any case, we can look for such affinity product groups that include our product that we want to cross-sell.
We then focus on all customers who have all of the affinity products except for our cross-selling product. These customers will then be the targets of our cross-selling effort. The techniques for fraud detection depend very much on the application concerned. In areas such as mortgage fraud or health insurance fraud the emphasis might be on detecting suspicious relationships between the participants involved. For example it might turn out that many accident victims are diagnosed by a small set of doctors, or that high-cost producing referrals involve a small set of doctors, or that a collection of property objects has been frequently changed hands recently, involving the same people, and increasing in price each time. Customer relationship management focuses on the "life-time-value" of a customer and addresses various issues such as customer acquisition, retainment, loyalty, cross-selling. It not only involves various data mining business questions and techniques, but also goes beyond data mining and may involve data management, warehousing and customer interfacing issues at a minimum, often the entire organizational structure and business processes as well. Finally, rate making is the topic of immediate concern in times of increased competitive pressure. The price of a policy is usually the most direct way that a competitor can use in order to increase his market share. The same forces that were discussed above are at work here: a movement away from standard policies with broadly defined risk classes, moving toward individualized risk assessment, and almost individualized pricing. As will be detailed in the next section, the methods used to set policy rates are undergoing change, and it is data mining again that can contribute innovative solutions to this task. In essence, data mining will allow to consider many more factors in the rate making process than was possible before. Due to the specialized nature of this topic, the following discussion may become more technical, at times. It may also require some more additional knowledge or experience with data mining and/or the SAS Enterprise Miner. The SAS Best Practice Paper on "Data Mining in the Insurance Industry" will provide an excellent coverage of these topics. It will also address rate making, using an approach similar to one of the presentations in chapter 3 below. 1.2 Introduction to Rate Making Traditionally, the process of rate-making involves the identification of risk factors and the computation of premiums based on expected claims and payments for the various combinations of predefined risks. Typical of this are one-page rate sheets showing automobile insurance rates based on automobile power and region, potentially modified by discounts for certain occupations or by age surcharges. This basic rate would be subject to penalties or reductions depending on accident frequency. By and large, the scheme would remain stable over many years. Only lately have insurers begun to construct more complex risk classes by considering additional factors, such as yearly mileage driven, number of other drivers in the household, use of a garage. Even with this extended scheme, the basic model is to lump customers sharing the same risk features together into one tariff, and separating them out on the basis of their accident behavior. One major implication of this approach is the fact that among new customers without driving history the good drivers will subsidize the bad drivers for a substantial time, until the premium reduction of the good drivers takes effect, which may be a slow process extending over years. Similarly, the rate increase for new bad drivers will lag their actual loss costs. Compounding factors are common regulatory requirements on the industry, attempting to make sure that insurance is affordable. This will limit the maximum amount of premiums that can be asked of customers, irrespective of their accident behavior. Therefore, even when considering long standing customers it is unavoidable that good drivers subsidize bad drivers to some degree. In fact entire risk groups, such as young male drivers, or drivers from inner city locations may be charged a premium more reflecting the higher insurance costs for the entire risk group rather than the individual accident behavior, since regulatory requirement will not allow to set premiums high enough to offset costs for bad drivers. Again good drivers will subsidize bad drivers. While this may be socially desirable, the recent flurry of deregulation and international competition has made it possible for insurance providers to offer innovative rate structures geared toward attracting and rewarding good customers while avoiding bad customers as much as possible. These rate structures
differ from the traditional ones by considering many more risk factors than was previously done. As an example, in one analysis it was found out that drivers who spend time working on their own cars tend to incur less accident costs. Such personal hobbies have traditionally not been used in rate setting. Doing so opens up the number of potential descriptors that could be used in defining risk groups. These descriptors might include anything from demographic information to personal preferences, habits, and stress factors, limited only by the availability of data. Data privacy and regulatory requirements may, for the time being, prevent widespread use of some of these data, but many more descriptors will be used in rate setting than has traditionally been the case. Data mining is the method of choice for managing the complexity introduced by using the additional variables. Using predictive modeling, the major determinants for accident behavior can be found, producing much smaller and more homogeneous subgroups of drivers or insurance customers in general. Rate setting will involve the determination of many niches of good and bad customers and result in rules that can characterize these various groups. This shifts the emphasis of the industry from the problem of determining the optimal premium to the issue of improving customer relationship : identify low risk customers, adjust their premiums in order to win their loyalty, improve customer retention and increase market share. Data mining is also the method of choice to monitor the effectiveness of these goals, as well as the performance of the various rules developed for setting the insurance premiums. 1.3 Rate Making: Data, Concepts, Terminology Like for any other data mining task, the most fundamental requirement is the availability of appropriate data. In the context of insurance policies, this requires information on the policyholder, on the risks covered, on cumulative risk behavior, and optionally on a variety of additional factors that may be relevant. The remainder of this paper will focus on automobile policies. A sample set of variables used in this context has been provided in Table 1. This variable set originates from an actual insurance provider and illustrates the kind of data used for rate setting. Most of the variables are self-explaining and non-surprising; note, however, the inclusion of data on marriage status and number of children. The use of other personal data may soon become common place. The insurance data also contain information for various types of non-accident claims, such as glass, storm, theft, and vandalism. The primary business question to be addressed with these data is the setting of rates for new and existing customers. Several of the variables may not be available for new customers, in particular data on premiums, costs, and profitability. Data on previous accident behavior may be incomplete or entirely missing, e.g. when dealing with a newly licensed driver. In order to predict a rate in these cases, care must be taken to exclude these variables from any rate-predicting model. In addition to rate setting, there is also the issue of "rate monitoring", i.e. dynamically managing the rate structure for existing customers. This involves analyzing the profitability of the various risk classes, and may result in taking corrective action in case of under-performance or in the face of competitive pressures. It may also involve a change in the "bonus malus" system, i.e. the penalizing or rewarding of customers based on their yearly accident claims. This latter aspect will, however, not be addressed in this paper. The basic concepts, which are used in rate setting and rate monitoring, are the "pure premium" and the "loss ratio". Pure premium is the premium that would be sufficient to offset all accident claim costs. For any given customer, it can be estimated as the product of (1) the likelihood of submitting a claim, and (2) the expected size of the claim. The pure premium is the "rock bottom" of any premium structure: at that level an insurance company could not generate a profit, but ideally, would not lose any money either. In practice, of course, there are many operating expenses in a business, including buildings, equipment, and personnel, that need to be covered, irrespective of how many claims occur. None of these costs are included in the pure premium. Neither is underwriting costs, such as commissions to agents and brokers. The actual premium therefore is always substantially higher than the pure premium. Variable Contract_id Period_begin Description Unique contract identifier Beginning date for the period covered
Period_end Ending date of the period covered Premium Basic premium paid in the period Damage_collision Premium for additional damage collision paid in the period Total_damage Damage of all accidents, broken down into: Vandalism Vandalism damage Storm Storm damage Glas Glas damage Fire Damage caused by fire Theft Damage caused by automobile theft Theft_parts Damage caused by theft of accessories Usage Type of use for vehicle Group Vehicle group for insurance purposes Z_tarif Tarif zone Z_damage Damage zone Z_theft Theft zone Z_glas Glas damage zone Previous Duration of contract for previous vehicle License_age Time since license was obtained Insurance_age Length of time with this insurance Driver_age Age of the driver Vehicle_age Age of the vehicle Possession_age Time since the vehicle was obtained New Newly insured person? Garage Uses garage? Num_vehicles Number of vehicles Licenses Number of license class Married Marriage status Children Number of children Gender Male or female driver Bonus Fraction of rate reduction or increase Niveau Tarif class Diesel Diesel engine? Region Region of vehicle location Table 1: Typical Variables Used For Automobile Insurance Loss ratio is another basic concepts used in rate making and rate monitoring. It is defined as the fraction of claim costs to premiums. A realistic figure in automobile policies is in the order of 70%- 80%, i.e. more than 70% of the premiums collected will be needed to pay for accident claims. Again, the loss ratio does not include the additional costs due to operating the insurance business. The loss ratio is only meaningful when calculated over several years, in order to take into account the varied propensity of the customers for accidents, thus generating claim costs. When calculated in this way, it is however, a very telling figure of the profitability of a company. Lowering the loss ratio is the major goal of insurance companies. This lowering can be accomplished in two ways: adjusting the rate structure in order to increase premiums or adjusting the customer population to producing fewer claims. In practice both ways are realized through rate adjustments: penalize bad drivers and thereby discourage them from continued coverage, reward good drivers into loyal customers. This behavior will extend toward new customers as well, by predicting their similarity to good or bad drivers, ideally setting up conditions to attract good drivers and discourage poor ones. The method to accomplish this data mining, which will be introduced in the next section. 2. Data Mining for Rate Making In this chapter we will briefly introduce Data Mining, cover some key characteristics of the SAS Data Mining solution (Enterprise Miner ) and present a first application of Data Mining to Rate Making. Many of the underlying Data Mining issues as well as technical details of the Enterprise Miner are
treated very briefly, or not at all, due to space limitations. The discussion is also narrowed by focusing on the needs of the insurance industry in general and on the particular approach taken in this paper in specific. For a more extended discussion, the interested reader is referred to the best practice paper by [SAS Institute 2000 ]. 2.1 Introduction to Data Mining Data Mining attempts to extract knowledge and relationships from data, more formally, SAS Institute defines data mining as "the process of selecting, exploring and modelling large amounts of data to uncover previously unknown patterns for business advantage". This knowledge obtained from data mining can take a variety of forms, some of which have been discussed in the last chapter, e.g. target marketing, cross-selling, fraud detection, customer relationship management, and rate making. The most prominent general technique to accomplish many of these tasks is called "predictive modeling". In this technique the business issue under concern (e.g. assessing customer risk) is investigated using historical data, which allow an identification of the relevant business classes (e.g. in risk analysis the "good" and the "poor" -costly- customers) and their characteristics. These characteristics involve the various properties (e.g. customer data) and their relationships and are cast in model form: Any new observation (e.g. a new customer) will be assigned a business class (e.g. a risk rating of good, bad or somewhere between) on the basis of its data characteristics. The results are statistically tractable with quantifiable reliability. It is the essence of predictive modeling to develop the model on known data and to apply the model to new data. Care must be taken, that the model only includes those characteristics that are available for the new data. In the context of risk analysis this means that characteristics such as number of accidents, costs incurred, profitability cannot be used for model building, since they will not be available for new customers. While model development as in predictive modeling is the major goal of data mining, it is neither the start nor the conclusion of data mining. Indeed, an entire methodology has been devised to cover the entire data mining process [The SAS Institute Data Mining Project Methodology]. This process will cover issues such as the definition of the business question, organizational and information management requirements, data requirements, model building techniques, and implementation, deployment and maintenance requirements. These issues are of paramount importance as they help to insure the success of a data mining project, including the use of the developed solution in an operational system. Focusing on the model building techniques another methodology (SEMMA) has been developed and is detailed in the appendix. It is a process description that starts with actual data and concludes with an optimal model. The acronym SEMMA stands for Sample, Explore, Modify, Model, and Assess. These phases will be summarized very briefly, for a more extended discussion see [From Data to Business Advantage: Data Mining, the SEMMA Methodology and the SAS System, SAS Institute 1998]. The main issues in the sampling phase concern efficiency and representation. Efficiency is particularly in need with large data sets, as it would be uneconomical and time forbidding to develop diverse model, including their comparisons on the entire data set. Instead, statistically representative samples can be constructed in different ways, for different representation. For example, it is often useful to enrich data with rare events (such as fraud) such that a sample contains an equal number of fraud and non-fraud cases. Another representation is required for data with an inherent structure across observations, such as different items bought by customers. A sample here should collect a subset of the customers, but for each customers all the items that were bought. An extended treatment of techniques and advantages of using sampling is available in [Best practice paper on sampling- Still to be inserted]. The next two phases in the SEMMA process are Explore and Modify. Exploration refers to an uncovering of key relationship in the data. These can be uncovered using, e.g. descriptive statistics, plots and graphs, and statistical methods (such as clustering and dimension reduction techniques). Understanding of the data greatly facilitates subsequent model building, especially when dealing with large numbers of variables. Modification is mostly driven by model building needs and focuses on issues such as the definition of needed derived variables (such as financial ratios from available primary variables), transformation of variables and values (e.g. standardization, logarithmic transforms, discretizing continuous variables into fixed ranges), treatment of missing values (computing a likely value when missing in the data, using different techniques) and excluding impossible, unlikely, or unwanted values (so-called "outliers"). These data manipulation techniques are essential for the
flexibility of allowing the building of alternative models, as well as for generating models that use variables and values meaningful in the specific business setting. The Modeling phase is the heart of Data Mining and uses specific statistical techniques in order to build relationships between variables. In predictive modeling we have one or more variables (so-called "targets") whose outcome we are trying to predict, for each data record. An example would be the prediction of the number of accidents claimed by new customers during their first insurance year. Model building involves a training set, e.g. historical data of customers with known accident behavior, and a statistical procedure that is capable to find the particular relationships between variables in order to enable such a prediction. The most common such procedures are decision trees, regression, and neural nets, and are described in the appendix, and more detailed in [SAS Institute 2000]. Briefly, decision trees present a model in a tree-like structure showing how the most important variables and values effect the decision criterion (i.e. the target). Its primary virtue is clarity and understandability of the model for users and experts alike. Linear Regression is a traditional statistical technique which attempts to find coefficients so that the target value for each customer can be predicted by a linear combination of variables weighed with these coefficients. Its use and interpretation requires more statistical sophistication. Neural Nets are statistically similar to regressions (a form of nonlinear, rather than linear regression) but present themselves with a user interface that shows the relationships between elements in a net-like fashion. Some of these elements correspond to variables (input and target variables), but other elements refer to yet unknown interaction components and are usually uninterpretable. Therefore, neural nets are powerful, are deceivingly easy to use, yet require for successful results considerable experience and time. The final phase of SEMMA is Assessment, which has the task of comparing different models and projecting the benefit of each model in statistical or business profit/cost terms. The most prominent tool to achieve this is the "lift chart". It shows how much better the model performs than a random chance decision, where performance again can be measured either statistically or in profit/cost terms. Assessment will be treated only briefly in this paper, in general the reader is referred again to the paper by [SAS Institute 2000]. 2.2 Introduction to Enterprise Miner The Enterprise Miner is the SAS solution for Data Miner and is organized along the lines of SEMMA as discussed above. In the following a brief survey of the functionality as needed for the purposes of this paper is provided. Some more information is contained in the appendix. For a much more complete treatment see [Getting Started with Enterprise Miner Software ].Programming with the Enterprise Miner consists of constructing a flow diagram by "dragging and dropping" the available icons onto a work space and connecting the icons with directed arcs, reflecting the flow of information. Most of the icons come with a default setting that can be complemented or changed as necessary. Typically, at the beginning of a flow diagram an Input node defines the data set to be worked with. The user will need to specify various roles of variables, such as input role and target role. For a predictive modeling task we need to have at least one target variable, and several (typically many many more) input variables. Proceeding along the lines of SEMMA, the input node may be connected to a sampling node, which in turn may be connected to a "Data Partition" node. For relatively small data sets, the sampling node may be omitted. The Data Partition node will subdivide the data into different subsets, with the intent that the model is build using one of the sets and the quality of the emerging model is evaluated using another set. Details of the model, such as extent of granularity, number of variables used, parameter estimates, are chosen by monitoring the performance of the model on the evaluation data set (called "validation set"). Sometimes another subset of the data is also created in order to be able to compare completely different models on a neutral data set (the so-called "test set"). From the Data Partition node, the flow may proceed to one of several Modify nodes, or may proceed directly to modeling nodes. Two frequently used Modify nodes are the "Transform Variable" node and the "Filter Outlier" node. In the Transform Variable node we can define new variables or transform (e.g. normalize) existing variables. In many of the subsequent examples we will use the Transform Variable node in order to generate new variables such "Pure Premium" and "Loss Ratio". In the Filter Outlier node we can specify constraints on variable values that we wish to enforce. In our examples, we wish to consider only losses up to a certain amount, we may want to guard against errors by insisting
that paid insurance premiums are positive numbers, or we may want to specify parameters that allow us to focus on selected parts of an analysis. Another special type of Modify node is the "Data Set Attribute" node, which allows us, for example, to change the role of a variable (such as from input to target) or its measurement scale. For example, if different occupations of customers have been computer coded as numbers, then it would be appropriate to rescale this variable as nominal, rather than leaving it on an interval scale. Since the Enterprise Miner makes use of such information about variables (so-called "meta-data"), it may become necessary to specify such changes in a flow diagram. The flow will finally lead to Model nodes, i.e. Decision Tree, Regression, or Neural Net. These nodes will attempt to construct a relationship between the target variables and the available input variables. For example, a target variables may be whether or not an accident claim is filed for a customer, and the model may determine that variables such as age of the driver, age of the car driven, and power of the car driven are the most important variables to determines the target. Depending on the model, details of the relationship may take a more mathematical or a more structural descriptive form. As an example of Figure 1: Risk Groups for Pure Premium a model generated by a decision tree we can inspect Figure1 which shows the result of a decision tree analysis attempting to predict a target (pure premium) from input variables. The most important 3 variables to account for the target values are driving license date, period of driving, and zip code. Initially (at the top of the tree) the analysis covered 4019 customers resulting in an average pure premium of 3109. This premium was much higher (96754) for a small group (43 customers) with driving license dates since 1991. Following the other branches we arrive at four terminal segments with different premiums, containing different numbers of customers. Taken together, the customers add up to the original 4019. One of the terminal segments still contains a large set of customers (3079) and could be subject to a more detailed subsequent analysis. In order to facilitate this, all segments are identified by a "node number" (not shown in Figure 1), which can, e.g. be selected for further analysis with a Filter Outlier node. It is also important to realize, that variables higher up in the tree (such as driving license date in the current example) have been determined to be "more important" than variables further down in the tree (such as zip code). This importance of variables can be quantified by different measures (such as the so-called "logworth"). In automatic mode, the system will always select the next variable on the basis of the importance measure. In interactive mode, a user might select a different variable, e.g. if it has a more customary business use and its importance measure is only marginally less than the top value. Sometimes it may improve results, if models are build separately for different parts of the data. For example, if the driving behavior is different between genders or countries it may be useful to build a model for each of these cases. In order to predict the driving risk for a new customer, we would first have to determine the nationality and if he/she is male or female, and then use the appropriate model.
Figure 2: Performance of Tree, Regression, And Neural Net Such model will make the model specification more complex and require the use of a "Group Node" and an "Ensemble Node". The Group node will support an automatic cycling through all relevant customer groups (such as gender and nationality) and will execute a subsequent model node as often as is necessary to cover all groups. An Ensemble node after the modeling node will combine all the various models into a single program code. Finally, models are compared in an assessment node. There are several different measures available to compute the accuracy and benefit of a model. Maybe the most popular of these measures is the "Lift Chart" which basically shows how much better the model is than a random chance decision. Applied to the task of predicting accident claims for insurance customers, the lift chart would order the customers as to their projected likelihood of filing a claim, with each customer falling into one of ten decile groups of this ranking. For each such group, the ratio of actual claimants found in that group over those claimants expected to be found in a same sized random group is computed, with the results for the first decile shown on the ordinal axis of a graph. From Figure 2 it can be seen that the top lift value is app. 3.30 for the best model, and 2.90 for the worst model. This gives an indication of the quality of the models (approximately three times better than chance in the top decile) as well as their relative performance. When data on actual costs and profits are available, the lift chart can display dollar figures rather than lift values. 2.3 Rate Making using Data Mining
Our first approach to rate making will use the concept of the pure premium, which was introduced in chapter 1. It is the amount offsetting accident costs by the insurance clients, only counting the actual accident costs not any other business or overhead costs. It can be estimated as the product between accident likelihood and estimated size of a claim. Figure 3a shows the data mining flow for this. Figure 3a: Predicting Number and Costs of Accidents Three different models are created in order to predict claim frequency: a neural net (labeled "Neural Network"), a logistic regression (labeled "Regression"), and a decision tree (labeled "Number of Accidents"), from left to right. The top part of the diagram also shows an Input Node at the start, a Data Partition Node, where the data are split into a training, a validation, and a test set, and an Assessment node comparing the 3 models. From the assessment chart in Figure 2 it can be seen that the decision tree performs best, and the neural net performs worst. The decision tree is then used for the remainder of the analysis and feeds into another decision tree estimating the expected amount of a claim (labeled "Costs" in Figure 3a). From these estimations, the pure premium can now be calculated. Before going into these details, the general strategy for using pure premiums for rate making will be described. The pure premium will be used in two ways for rate setting: determining new risk groups identifying premiums in need of adjustment For this analysis the test data set generate by the Data Partition Node at the top of the diagram in Figure 3a is used. In the first approach the risk variables and their structure in defining risk classes are determined, and could be the basis for a major overhaul of the currently used risk structure. In the second approach, the pure premium is simply used to pinpoint problematic parts of the currently used rate structure, leaving most of it intact. These 2 approaches can be implemented in a single flow structure as shown in Figure 3b. The top part of the figure is simple a copy of Figure 3a, just discussed. Its flow feeds into the lower part of Figure 3b by calculating the pure premium as the product of accident likelihood and estimated accident claim size. This is done using a Transform Variable node (labeled "Calculate pure premium"). The relationships between key variables, such as premium, pure premium, profitability, costs can then be explored with an Insight node; alternatively, the analysis proceeds using a Data Set Attributes node to declare pure premium as a target, and a Data Partition node in order to split the test data set into a training and validation component. Using these partitions, the most important risk variables predicting premium as well as pure premium are determined in two decision tree analyses.
Figure 3b: Rate Making Using Pure Premium The decision tree labeled "Pure Premium" implements the first approach mentioned above, the determining of new risk classes. The decision tree labeled "Premium" starts off the alternative analysis of disclosing problems in the current rate structure. The first approach is a fairly radical one because pure premium considers only costs due to accidents, and accidents only for the periods covered in the data present. As a result, the risk groups are likely to deviate substantially from currently used risk groups based on premium paid. This turns out in this example as well; the risk as disclosed by a decision tree analysis are shown in Figure 1 (above), the risk groups using premium paid also disclosed by decision tree analysis are shown in Figure 4 (below). Whereas the main variables characterizing pure premiums are driver related (license dates, location/zip code), the main variables characterizing the current premium structure are car related (power, age of car), reflecting traditional risk structures. Nevertheless, inspection of the pure premium risks have their own merits by disclosing niches of losses or gains. Inspecting Figure 1, one finds one small segment with recent license data and an extremely high pure premium (upper right node). Apparently, some major accident costs were incurred in that segment. This may be a result of the fact that in this analysis accident losses were not "capped" at (i.e. limited to) a certain amount, allowing a few large losses to bias the analysis.
Figure 4: Risk classes for Premium Paid The second approach (identifying premiums in need of adjustment) is something that should be monitored on an ongoing basis, as part of a profitability analysis: in all risk groups, the premium paid should always be well above the pure premium, otherwise the company is sure to lose money in this GROUP _FREQ STAT_ DIFF 8 931,00 MEAN -558,472 8 931,00 STD 66225,63 9 1065,00 MEAN 6505,428 9 1065,00 STD 4645,161 10 1237,00 MEAN 768,6775 10 1237,00 STD 3951,908 11 546,00 MEAN 3445,336 11 546,00 STD 4333,881 12 104,00 MEAN 14644,7 12 104,00 STD 6220,951 13 47,00 MEAN 6257,001 13 47,00 STD 5401,436 14 73,00 MEAN 18199,08 14 73,00 STD 8326,08 15 16,00 MEAN 27253,4 15 16,00 STD 8735,049 Table 2: Differences between Premium and Pure Premium in each Risk Segment particular risk segment. This analysis is carried out in the lower left branch of the diagram of Figure 3. The risk categories are determined through decision tree analysis, a variable identifying the leaves of the tree is exported to a transform variable node where for all leaves the difference between premium paid and pure premium is calculated. The resulting data set is sorted and statistically summarized in a SAS code node, and displayed using the Insight node, as shown in Table 2. Results identify one segment (Group 8) where the difference between premium paid and pure premium is negative. Inspection of Figure 4 reveals, that segment 8 (lower left node) is fairly sizeable (N=931) and consists of customers with cars based on power between 6.5 and 10.5 and age of the car less than 10.5 years. This segment should now be the focus of intense scrutiny by actuaries and underwriters. The results of this section have shown how pure premium can be predicted using data mining technology and how this prediction can be used to assess and improve the current rate structure. The
improvements may take the form of quantitative rate adjustments in case of segments that identify over- or under-performing customer groupings, or may take the form of qualitative adjustments by taking into account different risk variables or classes. 2.4 Rate Making Using Traditional Statistics Traditionally, statistical techniques have been used for rate making. A detailed discussion of such an approach is beyond the scope of this paper. The interested reader is referred, however, to the appendix, where one such sample analysis including a comparison to a corresponding data mining analysis is carried out. The statistical approach involves as a first step the definition of the risk variables and their value structure. In principle, the significance of potential risk variable for accident likelihood and claim size can be computed, in practice, however, there are severe limitations on the number of variables that can be used. A "full blown" analysis using all variables and all interactions between them is generally not feasible. Just using 10 variables with 10 values (such as 10 regions, or 10 age groups) each would generate 10 billion potential risk groups to be analyzed. Extreme care needs to be taken therefore on which variables to include into the analysis and how to structure or break up their values. This is a task performed by actuaries in an insurance department. Once this has been done, the pure premium can be calculated, using for example variants of statistical regression. Knowledge of the pure premium for the risk combinations will then lead to the formulation of "underwriting rules" which specify which risks will be insured at which prices. The underwriting rules will not only be based on the pure premium, but also on other business cost and profitability considerations, on competition and marketing goals as well as historically grown business mandates and customer orientations ("the corporate culture"). Changing the rating structure will become a major endeavor and is likely to provoke massive resistance within the insurance organization. In times of rapid market changes, including economic and competitive challenges this resistance to change is becoming a liability, even a threat to corporate survival. New methods of rate making are needed that can take into account many more and different types of variables, whose effects on rates can be analyzed reliably and in short time spans. Particularly increasing use of "life style" variables such as hobbies, travel and vacation habits, residence style, place, and neighborhood, including its demographic characteristics complicate a traditional statistical analysis beyond repair. Methods that replace standard, all exhaustive approaches with more heuristically oriented, interactive approaches are needed for more flexibility, handling of larger variable sets and automatic or interactive determination of optimal value structures. Most of the tools needed are still statistically based, but now require much greater diversity and time commitment from potential users: Decisions on variable selection and value structure that have been performed by actuaries over years would now have to be performed on many more variable groups in a matter of months or even weeks. The solution that makes this task feasible is data mining. Data mining provides several major productivity enhancements to the rate making process by automating large portions of the rate making process in a flow diagram (such as discussed in Figures 3a&b above) the effects of changes (using different variables, different value groups, different models) can be quickly explored the effects of changing the statistically optimal rate structures to an alternate rate structure that takes into account market pressures and strategic visions can also quickly explored in the same diagram, using e.g. cost/profit based lift charts as part of the assessment process technical details, such as determining optimal values structures and important variables and interactions can be handled semi-automatically, does not require quite as much statistical sophistication as was needed before and therefore relieves some bottleneck resources important logistic tasks, such as accessing current versions of the data from databases or data warehouses, validation and preprocessing of data, supporting the application of resulting models in operational systems, and extracting rules for inspection in underwriting department are supported by the data mining environment. This greatly improves flexibility and market response time. Most importantly, the analysis can include non-traditional variables for innovative rate structures, devise rates specifically for high profit customers (with so-called high "life time value"), in fact make strides toward an almost individualized rate scheme. This gives the company the chance to take a lead in creating new market opportunity and focus on market niches that so far had been considered too small to be included in the standard rate making process.
For the traditional statistician these new opportunities come at a price: the optimality of the rate structure can no longer be guaranteed when many variables are involved. The trade-off involves giving up an exhaustive analysis of a manageable set of variables and values in favor of a more approximate analysis that uses all information that might potentially be relevant. Such an approach exhibits a much higher flexibility to target the relevant risk classes. 3. Data Mining Application: Finding a new Rate Structure In this chapter we address the question of how to build a completely new rating structure. Again, there is not a single "correct" perspective on how to approach this, but different practices ranging from manual efforts by actuaries and underwriters over statistical procedures to special purpose software provided by insurance vendors. One of the recent prominent ways to devise a new rating structure is to focus on the "loss ratio", i.e. the ratio between claim losses and premiums. This requires the use of historical data covering multiple years, as losses are random events and cannot be modeled realistically by only focusing on a single rating period. Also provisions need to be taken that a rare single large loss does not bias the analysis unduly, usually therefore losses are "capped" at a certain amount. The rate Figure 5: Generation of New Rate Structure Using Loss Ratio making process consists of predicting the loss ratio from the risk variables and then focus on those rate groups that show a relatively high variance. This variance would indicate that there are subgroups with different business implications in a given rate class, which then will need to be subdivided in order to achieve a more homogeneous behavior. This process is illustrated in the diagram Figure 5 below: First the loss ratio is predicted using the flow leading to a decision tree (labeled "loss ratio", top two rows). In the course of it, a Transform Variable Node is used to define the Loss Ratio, a Filter Outlier Node in order to cap the Loss Ratio, and a Data Set Attributes node in order to define Loss Ratio as a target. Then, using the Reporter node we can inspect the decision rules produced by the decision tree analysis (labeled "Loss Ratio"), with the purpose of focusing on those nodes with a high standard deviation in loss ratio, e.g. the following rule is found IF Car brand IS ONE OF: CITROEN NISSAN AND Age of the car < 13.5 AND
1982.5 <= Driving license date THEN NODE: 13, N: 107, AVE : 4.12953, SD : 6.28759 which can be summarized as follows: "there is a high loss ratio of about 413% (see below) coupled with a high standard deviation., if the conditions listed after the "IF" are met. This is a small segment with only 13 customers, its internal ID-Number is 13". When computing the loss ratios, we need to recall that it is defined as the ratio of costs to premium. A loss ratio of 1 then means that 100% of the premiums are needed to offset cost. In the rule above, the ratio is app. 4.13, hence more than four times the collected premiums (i.e. 413%) are needed to offset costs. Using the Filter Outlier node the terminal nodes in question (here only node 13 from the example above will be considered) can be further analyzed, either automatically by a subsequent tree analysis or interactively (lower left part of Figure 5). The purpose of such analysis would be to further reduce the variance of the loss ratio, by subdividing customers into separate risk groups. This is illustrated in Figures 6a&b. Figure 6a: Interactive Exploration of Risk Classes: Using "Age of the car" Pursuing the interactive analysis of Node 13 from the example above in more detail, we generate the potential variables for the first split of the decision tree, and may select "age of the car" as a splitting variable, as shown in Figure 6a. As a result, we find a subset with a very high loss ratio (789%) and a high standard deviation for the customer group with "age of the car >= 9.5". Attempting an alternative split using the car location "French zip code" we get a similar result as shown in Figure 9b. Again there is a subgroup ("French ZIP code >= 68,340") with high loss ratios and high standard deviations. Insurance expertise is needed to decide among these and other alternative rate grouping. Assuming we use the first split on "age of car" (as in Figure 6a), a potential strategy of rate modification might be to increase the rates for these clients, or more drastically (done here), to exclude these clients with "age of car > 9.5" from coverage, assuming this is legally possible. As shown in the lower right part of Figure 5, these "high losers" are excluded using a SAS Code Node, and the resulting modified Loss Ratio is calculated in a Transform Variable node. Even though they are only few cases affected (34 in the training set), the loss ratio improves from 82.7% in the training set to now 79.0% (Node "New Loss Ratio" in Figure 5). (Note that more sophisticated strategies (such as differentiated rate adjustments based on profitability analysis) can be used in situations where the "high cost groups" cannot be simply excluded. Figure 6b: Interactive Exploration of Risk Classes: Using "French ZIP code"
The example in this chapter shows how data mining can be used to improve the loss ratio, which has a direct implication for the profitability of an insurance company. While much of this process can be automated, many detailed decisions (such as which variables from a suggested list to include in the final rate structure) requires human intervention, in particular a well-founded business expertise. The analysis shown can be used in different ways, from a deploying an actual new rating system, to simply analyzing current profitability. 4. Data Mining Application: Refining an existing Rate Structure Typically, rates can be redefined only occasionally, and need to remain in a stable framework for some time. So a more realistic task involves adapting/refining a rate structure that exists already. This process is very similar to the one just outlined: We start with analyzing the current rate structure using PREMIUM as a target variable. The resulting segments are examined for high variations, either of the premiums itself, or if that is stable, of the loss ratio. The segments in question are then adjusted for rate increases or discounts. The complete flow is shown in Figure 7 below. First we have the flow analyzing the premium using a decision tree (top row of diagram). In the course of it, a loss ratio variable is defined in a Transform Variable Node and training and validation data are generated in a Data Partition Node. The resulting first several tree levels of the decision tree are shown in Figure 8. The second part of the analysis build a so-called "stratified model" for segments to be modified (middle part of Figure 7), and the last part of the analysis defines the new, refined rating structure and calculates the gains in loss ratio achieved (bottom part of Figure 7). These last two parts of the analysis can be repeated iteratively, until satisfactory improvements are realized. The first part of the analysis generated the risk factors for the currently used premium structure as shown in Figure 8. From this tree display, or alternatively, the rule set from the Reporter node, or using predefined criteria (such as standard deviations or premiums exceeding a certain amount) we arrive at criteria for selecting tree segments for further refinement. The most common of these criteria is a high amount of variation within a tree node. For the example in Figure 8 below, various segments are shown, each identifying the number of customer (N:1295 at the top segment), the average rate for customers in this segment (app. 6274 in the top segment), and a segment identification number (Node 1 in the top segment). We will focus on the two segments with identification numbers 3 & 6: Node 3 is at the top right part of the diagram ("Car value >= 17.5", the node Id itself not shown in the figure), Node 6 is at the lower left of the diagram.
Figure 7: Refining An Existing Rate Structure. In the second part of the analysis, these 2 segments are selected and then fed to another decision tree (in the middle row of the diagram in Figure 7). For this purpose a filter variable is generated in a Transform Variable Node and the data contained in nodes 3 & 6 are selected in a SAS Code Node. The node identification variable is used as a group variable, needed for building a stratified model for these two segments. In data mining, we are talking of a "stratified model" when we are building separate models for separate parts of the data. A common case might involve separate models based on gender, or for an internationally operating insurance company based on country. Computationally, this requires in Enterprise Miner the use of a "Group Node" which cycles through the values of the critical variable (2 cycles for gender, multiple cycle for a set of different countries). In each cycle a separate model is built, which are then combined using an Ensemble Node". This is shown in the middle part of Figure 7: we have a group node and an Ensemble node before and after the decision tree. Since the segments entering the decision tree are subdivided by the decision tree analysis, for each of the resulting subsegments the premiums and profitability will need to be reconsidered. This is done in a Transform Variable Node in the lower left part of Figure 7 (labeled "New Premiums"), where the premiums for the various sub-segments are
Figure 8: Result Of Predicting Premiums newly calculated, in the following way: if the profitability is positive in a subsegment, then an appropriate premium reduction is offered to the clients (here 5%), else a rate increase is in order (10%). Finally in the SAS Code Node toward the right bottom of Figure 7, labeled "Put data sets back together") the refined data are merged backed into the remaining, unrefined data, and the resulting new premiums and loss ratios are computed to conclude the analysis. Node 3 Node 6 IF 17.5 <= Car value THEN NODE: 3, N: 32, AVE : 23202.4, SD : 9753.32 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 THEN NODE : 6, N: 213, AVE : 6352.95, SD : 3473.55 Table 3a: Risk Definitions of 2 Segments Before Refinement IF 17.5 <= Car value AND IF Tariff zone IS ONE OF: 8 9 10 THEN NODE: 3, N: 10, AVE: 30293.9, SD: 10478 IF 17.5 <= Car value AND IF Age of the car < 8.5 AND Tariff zone IS ONE OF: 2 3 4 5 6 7 THEN NODE: 4, N: 14, AVE: 22460.2, SD: 6779.28 IF 17.5 <= Car value AND IF 8.5 <= Age of the car AND Tariff zone IS ONE OF: 2 3 4 5 6 7 THEN NODE: 5, N: 8, AVE: 15637, SD: 6464 Table 3b: Risk Definition Of Segment 3 After Refinement The effects of the rate refinement can be assessed qualitatively by comparing the definitions of the risk classes before and after the refinement, and quantitatively by comparing the loss ratios before and after refinement. Tables 3 a,b,c show the various risks definitions encountered.
The rules shown in Tables 3b & 3c are not yet in optimized form, as they originated from two separate decision trees. For example the first rule from Table 3c contains the clause "car value < 17.5" from decision tree 1 and the clause "car value < 4.5" from decision tree 2. Obviously, the first clause should be dropped. Such code optimization, however, is beyond the scope of this paper. IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF Car value < 4.5 AND 6.5 <= Age of the car THEN NODE: 6, N: 50, AVE: 5807.4, SD: 3243.92 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF Age < 36 AND Tariff zone IS ONE OF: 1 2 3 4 AND Age of the car < 6.5 THEN NODE : 8, N: 14, AVE: 8372.93, SD: 3081.97 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF 36 <= Age AND Tariff zone IS ONE OF: 1 2 3 4 AND Age of the car < 6.5 THEN NODE : 9, N: 51, AVE: 4440.06, SD: 1458.21 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF Age < 40.5 AND Tariff zone IS ONE OF: 5 6 7 8 9 10 AND Age of the car < 6.5 THEN NODE: 10, N: 15, AVE: 8539.47, SD: 3664.78 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF 40.5 <= Age AND Tariff zone IS ONE OF: 5 6 7 8 9 10 AND Age of the car < 6.5 THEN NODE: 11, N: 35, AVE: 6562.66, SD: 3375.42 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF Tariff zone IS ONE OF: 1 2 3 4 AND 4.5 <= Car value AND 6.5 <= Age of the car THEN NODE: 12, N: 28, AVE: 6715, SD: 4768.79 IF Age of the car < 8.5 AND Car category based on power < 6.5 AND Car value < 17.5 AND IF Tariff zone IS ONE OF: 5 6 7 8 9 10 AND 4.5 <= Car value AND 6.5 <= Age of the car THEN NODE: 13, N: 20, AVE: 8667, SD: 2477.46 Table 3c: Risk Definitions For Segment 6 After Refinement The resulting rules are split into several risk subcategories whose merits and benefits need to be assessed from a business and marketing point of view. Data mining will provide "data driven" suggestions whose implications can then be investigated. As a result of the refinement in this example, the loss ratio improved from the original 87,0% to now 80,2%. In this final example, a practical use of data mining for incrementally adjusting the current rate structure was given. This allows insurance companies to maintain a continuity in their current rating system, while at the same time focusing on segments which show high losses or high variability or simply are under competitive pressure. The segments can then be re-defined, potentially using additional variables that so far had not been included in the analysis. Data mining will suggest the most significant such variable but allow the user to make his own final selections. Data mining will also support the potentially complex process of building different models for different customer segments, putting the results back together, and allowing iterating the process until a satisfactory result is obtained. 5. Conclusions
In addition to standard statistical procedures data mining can support the process of rate making. Its strengths are in identifying small niches of profitability and losses that might otherwise not be found. Another strength is the use of all available variables in optimizing the prediction of premiums, profitability, or loss ratios, rather than using only those variables that in the opinion of the expert are the most natural predictors. Additionally the identification of optimal value ranges for highdimensional variables (ages, zip codes) or value groupings for categorical variables (professions, car models) are found as an automatic by product. Finally, the resulting models can be compared to each other and to currently used practices. It has also been shown, that rate making is not a "one dimensional process", i.e. there are several legitimate approaches that can be taken, each under different circumstances, disclosing different aspects of business value. Pure premium, premium analysis and loss ratio all have their value in this process. There are also practical considerations to be taken care of, for example making sure that "similar clients" pay similar amounts of premium. This cannot be guaranteed by data mining, clients that end up in different tree segments may pay entirely different rates, however similar they may appear to be. This turns out to be a difficult question, because there are a number of cases, such as young male drivers, or drivers living in high accident environments where certain characteristics (age, location) may override any other similarities that may be present. In current practice, smoothing the transitions between neighboring risk classes is usually done by hand, again as part of a post-processing business review of the results. 6. References Rate making is a topic of particular competitiveness. Therefore there is very few if any public material on rate making on the market. The material in this paper is based on verbal communication from SAS consultants and on own consulting experience. More general references on data mining for the insurance industry and SAS Best Practice Papers mentioned above are as follows: Data Mining Solutions for the Insurance Industry http://www.kdnuggets.com/solutions/insurance.html General Insurance Links on the WWW: http://www.barryklein.com/publishers.htm http://www.indm.com/resource.htm SAS Best Practice Papers General Information: http://sww.sas.com/mkt/smw/news/isi/aug99/stories/02.html Data Mining and the Case for Sampling, SAS Institute 1999 Data Mining in the Insurance Industry, SAS Institute 2000 Introduction to the Enterprise Miner : Getting Started with Enterprise Miner Software, SAS Institute, 1999 Enterprise Miner Software, A SAS Institute Product Overview, 1998
Appendix A. An Introduction to Statistical Rate Making A.1 Introduction Traditionally rate making has been accomplished using statistical procedures. One of the most commonly used such technique is linear regression. Standard linear regression models assume that the response variable is normal or can be transformed to a normal one. This assumption is usually violated for categorical variables and also for bounded variables, such as positive counts. An extension to linear regression handling these situations are "generalized linear models". The main extension involves a socalled "link function" L, so that instead of the regression equation Y = X*Coefficient_Vector we have L(Y) = X * Coefficient_Vector. If Y represented an amount (therefore > 0), then it would be difficult to enforce constraints in the standard regression equation that would ensure that Y remains positive, for arbitrary values of X. A better approach would the use of generalized linear regression with a link function such as L(y) = log(y). In this case the response term L(y) would be able to freely range through positive and negative numbers, and the regression would not need to model any constraints. When recomputing Y from log(y) using the formula y=exp(l(y)), Y will always turn out to be positive, as required. This constrained has now been enforced using the LOG function and its inverse (here the EXP function) and need not be handled by the regression coefficients. Similar such link functions can be derived for other situations, for example if 0 < y < 1 then a link function of type log(y/(1-y)) might be used. More formally In generalized linear models, the relationship between the expected value of the response variable (e.g. claim count, claim cost) and the linear predictor is given by a monotonic differentiable link function The situation discussed above applies when using traditional linear models for rate making: it needs to be taken into account that the model will be used to predict counts of accidents and costs of claims, both of which are non-negative, thereby violating some assumptions of the linear model. Counts also do not follow the normal distribution: most clients will (hopefully for the company) have no accidents, some will have a single accident, and fewer clients will have two or even more accidents. When estimating the average cost of a claim, the result will be restricted to a positive number. Another assumption of linear regression is to assume that the variance of the data is constant for all observations. This may not be reasonable in the context of rate making; the variance may increase with the mean of the costs or vary with particular risk factors. It is usually the case with bounded variables that the variance depends on the mean of the variable, e.g. when analyzing a risk group whose mean of the response variable approaches either end of the interval, then the variance in these cases approaches zero. A generalized linear model handles this situation by assuming that the variance is an appropriate function of the mean: Variance(y) = F(mean_y) During the estimation of the regression equation, the observations are then weighed inversely according to the variance function F. For this to lead to statistically correct results, it is required that the observations have a probability distribution from an exponential family. Most common situations involve the Poisson distribution (for counts) and the Gamma distribution (for positive interval variables, such as amounts). This situation is summarized in Table A-1, below.
Linear Model Logistic Regression Poisson Regression in Log Linear Model Gamma Model with Log Link Response Continuous A proportion A count Positive continuous Variable Distribution Normal Binomial Poisson Gamma Link Function Identity Logit Log Log Table A-1: Model Choices and Parameters For Different Types Of Response Variables A.2 Analysis using Generalized Linear Models The data set in Table A-2 below originated from France and represents data on car insurance. There are 2160 observations, 5 independent risk variables (car power, car value, tariff zone, age of drive, and car age) are used in the analysis. Each of these variables is discretized into the following risk categories, with specific values chosen to enable a comparison with data mining solution, to be explained below: power: < 5.5, 5.5-6.5, >= 6.5 value: < 13.5, 13.5-17.5, >= 17.5 t_zone: < 5, >= 5 age: < 32.5, >= 32.5 car_age: < 7.5, 7.5-8.5, 8.5-11.5, 11.5-14.5, >= 14.5 The number of potential risk classes is 3*3*2*2*5= 180. Using the actual data, there are many empty cells, so that only 87 categories are actually filled. Variable Role Description NO_ACC target Number of accidents C_LICENS rejected Period of driving license BRAND rejected Car brand USAGE rejected Car usage PROF rejected Profession POWER input Car category based on power VALUE input Car value ZIPCODE rejected French ZIP code T_ZONE input Tariff zone M_STATUS rejected Marital status LICENSE rejected Driving license date CLOSED rejected Closed garage? PREMIUM rejected Premium paid CONT rejected Contract ID NOACC rejected Number of accidents COST target Total accident costs BONUS rejected No claims bonus AGE input Age CAR_AGE input Age of the car CONT_AGE rejected Age of the contract PROFIT rejected Profitability (= premium - cost) GOOD_BAD rejected 25% most profitable are 1 CRASH rejected Had accident? Table A-2: Variables Used In Generalized Linear Model Example The models for accident frequency and average cost are computed using appropriate procedures of the SAS System. The code is shown in the appendix. Results are shown in the table A-3 below and indicate the all variables except car value are significant in estimating the claim frequency.
Source Deviance NDF DDF F PrF ChiSquare PrChi 1 INTERCEPT 185.0353 0 624.... 2 TARIF_ID 180.6882 1 624 24.8358 0.0001 24.8358 0.0001 3 AGE_ID 166.7628 1 624 79.5583 0.0001 79.5583 0.0001 4 VALUE_ID 166.7138 2 624 0.1399 0.8695 0.2798 0.8694 5 POWER_ID 129.0387 2 624 107.6219 0.0001 215.2437 0.0001 6 CAR_AGES 109.2216 4 624 28.3045 0.0001 113.2180 0.0001 Table A-3: Results from General Linear Models When checking the significance of the various risk categories we see from the table A-4 below that there are insignificant groups, such as power < 5.5, car_ages 7.5-8.5, 8.5-11.5. At this point the analysis could be re-run after collapsing insignificant grouping with their neighbors. Parameter LEVEL1 DF Estimate Std Err ChiSquare PrChi 1 INTERCEPT 1-1.5857 0.3868 16.8052 0.0001 2 TARIF_ID <=5 1-0.2932 0.1038 7.9746 0.0047 3 TARIF_ID 5 0 0.0000 0.0000.. 4 AGE_ID <32.5 1 1.2162 0.0925 172.8556 0.0001 5 AGE_ID =32.5 0 0.0000 0.0000.. 6 VALUE_ID 13.5-17.5 1 0.1566 0.4260 0.1351 0.7132 7 VALUE_ID <13.5 1 0.3159 0.3732 0.7165 0.3973 8 VALUE_ID >=17.5 0 0.0000 0.0000.. 9 POWER_ID 5.5-6.5 1-0.2191 0.1340 2.6746 0.1020 10 POWER_ID <5.5 1-0.0378 0.1077 0.1233 0.7254 11 POWER_ID =6.5 0 0.0000 0.0000.. 12 CAR_AGES 11.5-14.5 1 0.1975 0.1380 2.0475 0.1525 13 CAR_AGES 7.5-8.5 1-0.0457 0.1769 0.0666 0.7963 14 CAR_AGES 8.5-11.5 1-0.0526 0.1436 0.1341 0.7142 15 CAR_AGES <7.5 1-0.3637 0.1407 6.6820 0.0097 16 CAR_AGES =14.5 0 0.0000 0.0000.. 17 SCALE 0 1.1479 0.0000.. Table A-4: Tests Of Significance For The Different Value Ranges The discussion above is only supposed to convey a general idea of how this statistical process will proceed. In general, care needs to be taken to define "appropriate risk categories" in the first place. The break-ups of the various variable values into ranges would be done on the "basis of experience", collapsing of value ranges may need to be tried out in different ways, as some such aggregations may prove not to be significant either. Data need to be present for every cell combination (which is not the case above). A.3 Analysis using Data Mining
In Figure A-1 the same data set (Table A-2) as above is analyzed using data mining: The discussion of this flow diagram will proceed in two parts: Part one involves the automatic determination of value ranges and is shown in the right part of Figure 1 (rightmost 2 icons from top and middle rows). Part two involves the estimation of the accident counts and amounts, using the first 4 icons in the top row of Figure 1 and the leftmost 3 icons (Losses, Accidents, Reporter) from the lower part of the diagram. An assessment icon at the bottom adds some comparison between these 2 parts. Figure A-1: Comparative Data Mining Analysis Part one: Determination of Value Ranges: The discretizations of the continuous variables can also be achieved using the Variable Selection node, but leads to quite different results since there continuous variables are broken up into 16 different ranges. In this case the results of the Variable Selection node do not help to improve the subsequent tree analysis; interestingly, when using Variable Selection the neural net performs slightly better than the tree (but worse than the tree without Variable Selection). Part two: Estimation of Accidents Counts and Amounts: In the Transform Variables node, a copy of the variable crash (see Table A-2) is created and declared as target in the subsequent Data Attributes node. After the partitioning of the data, a tree analysis predicting accident likelihood is performed for this target, followed by the prediction of the amounts of the accident losses (node labeled "Accidents" and "Losses"). The subsequent Reporter node allows the inspection of the risk categories created in the model for this prediction, which in semi-english rule form is shown in Figure A-2 below: The last rule reads: If the car is at least 11.5 years old, the tariff zone is 6-10 and the age of the driver is between 35.5 and 50 then there is a 66.7% chance of having one or more accidents; there are 21 such people in the current data, the decision tree No 15 identifies this situation. Compared with the 87 risk classes generated by the statistical model the result is quite different, there are now only 8 different risk groups. Also, the risk categories are quite heterogeneous, both in terms of structure as well as size. Structurally there are risk classes consisting of only a single variable (age < 32.5) as well as classes made up of 2 or 3 variables. The main advantages are the automatic discretizations of the continuous variables, not based on experience, but based on optimizations using the actual input data the creation of an optimal set of rate classes, based on the use of validation data. Explicit representation of the interactions between the risk variables, e.g. the second rule in Figure A-2 specifies the particular values of "car power" and "drivers age" that form an interaction
IF Age < 32.5 THEN NODE: 4, N: 165, 1+ ACCID: 70.9%, 0 ACCIDE: 29.1% IF Car category based on power < 7.5 AND 32.5 <= Age < 35.5 THEN NODE : 8, N: 52, 1+ ACCID: 40.4%, 0 ACCIDE: 59.6% IF 7.5 <= Car category based on power AND 32.5 <= Age < 35.5 THEN NODE: 9, N: 12, 1+ ACCID: 91.7%, 0 ACCIDE: 8.3% IF Tariff zone IS ONE OF: 1 2 3 4 5 AND 35.5 <= Age < 50.5 THEN NODE: 10, N: 306, 1+ ACCID: 19.3%, 0 ACCIDE: 80.7% IF Tariff zone IS ONE OF: 1 2 3 4 5 AND 50.5 <= Age THEN NODE: 12, N: 371, 1+ ACCID: 5.1%, 0 ACCIDE: 94.9% IF Tariff zone IS ONE OF: 6 7 8 9 10 AND 50.5 <= Age THEN NODE: 13, N: 106, 1+ ACCID: 19.8%, 0 ACCIDE: 80.2% IF Age of the car < 11.5 AND Tariff zone IS ONE OF: 6 7 8 9 10 AND 35.5 <= Age < 50.5 THEN NODE: 14, N: 45, 1+ ACCID: 22.2%, 0 ACCIDE: 77.8% IF 11.5 <= Age of the car AND Tariff zone IS ONE OF: 6 7 8 9 10 AND 35.5 <= Age < 50.5 THEN NODE: 15, N: 21, 1+ ACCID: 66.7%, 0 ACCIDE: 33.3% Figure A-2: Decision Rules Resulting From Data Mining Analysis Summarizing the results, the same set of variables is significant in both the statistical and the data mining analysis: age, car age, car power, tariff zone. The structure of the risk classes has turned out to be quite different, however. Whereas in the statistical model all interactions between the risk variables are computed at first (and will need to be reduced in subsequent steps) the data mining approach will limit itself by generating only the significant interactions from the start. This latter approach can scale up well when many more variables are used. B. Code for generalized linear models estimating accident frequency (counts): proc genmod data=insur4; class tarif_id age_id value_id power_id car_ages; model accids= tarif_id age_id value_id power_id car_ages /DIST=POISSON LINK=LOG OBSTATS OFFSET=Ln DSCALE TYPE1 TYPE3 MAXIT=500 CORRB; make parmest out=parmest1; make obstats out=obstat1; make modfit out=modfit1; make type1 out=a_type1; make type3lr out=a_type3; make modfit out=modfit2; C. Enterprise Miner SAS Institute s Solution for Data Mining
Data mining involves an interactive, iterative procedure in order to generate new information from data. "SAS Institute defines data mining as the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns of data for business advantage". What is required to structure the data mining process is a framework of data mining tasks and the sequence of these tasks. SAS Institute defines this framework as the SEMMA Methodology. SEMMA stands for Sample, Explore, Modify, Model and Assess and describes a sequence of steps that may be followed during a data mining analysis. This logical superstructure provides users with a scientific, structured way of conceptualizing, creating, and evaluating data mining projects. The Graphical User Interface (GUI ) and functionality of Enterprise Miner are constructed to support this methodology. As the following figure (FigureA-3) shows, the Tool window on the left consists of all the analysis options organized according to the SEMMA process. Users can choose the tool nodes either from the Tools window or by customizing their own Tool Bar at the top of the window. By dragging and dropping the tool nodes onto the diagram editor, the user can construct a process flow diagram (PFD) of his own data mining project. Features of theenterprise Miner are ordered by the SEMMA Methodology are as follows: Data Access and Sampling: Input Data source: enables the user to specify the data sources to be used in analysis, whether they be SAS data sets or views to any of the over 50 different data structures SAS software can read including ORACLE, DB2, Excel and others. If installed SAS Institute s data warehousing solution SAS/Warehouse Administrator software is installed then Enterprise Miner can directly read relevant metadata from there. If not then Enterprise Miner can create this metadata as a first step. Sampling Node: Analysis can be performed on the full data store or on representative samples of the data to increase efficiency. The sampling node performs simple random sampling, nth-observation sampling, stratified sampling, first-n sampling or cluster sampling of input data Data Partition Node: enables the user to partition the input data source into training, validation and testing data Figure A-3 Enterprise Miner s GUI facilitates SEMMA Methodology Data Exploration: Distribution Explorer Node: allows the user to create multi-dimensional histograms, frequency distributions and descriptive statistics
Multiplot Node: automatically generates distribution plots for each variable as well as plots illustrating the relationship of each variable to the target Insight Node: is designed for the exploration of the data through graphs and analyses linked across multiple windows. The user can analyze univariate distributions, multivariate distributions (also principle components), create scatter and box plots, mosaic charts and so on. Association Node: computes item associations and sequences, e.g. items frequently purchased together by customers. Variable Selection Node: can assist the user I reducing the number of inputs by dropping those variables, that are unrelated to the target, have a high percentage of missing values and/or have determinative relationships in hierarchies Data Modification: Data Set Attributes Node: changes the role of data sets and variables (e.g. from training to scoring, from target to input) Transform Variables Node: helps the user to create new variables and modify existing variables with a number of simple transformations, binning transformations or best power transformations. Filter Outliers Node: provides the capability of automatically eliminating rare or extreme values based on a wide range of criteria. Data Replacement Node: provides replacement of missing values but also selected nonmissing values with a wide range of options: mean, median, distribution based statistics or imputation using tree-based methods Cluster Analysis Node: provides k-means clustering to partition the data into clusters suggested by the data. An optimal number of clusters can be determined. SOM/Kohonen Node: enables the user to create Kohonen networks, Self-Organizing Maps (SOM), and VQ networks. Data Modeling: Regression Node: enables the user to fit linear and logistic regression models to a predecessor data set in an Enterprise Miner process flow. Data interactions and forward, backward, and stepwise selection methods are supported. Decision Tree Node: enables the user to create decision trees that classify records based on the value of a target variable, predict outcomes for interval targets, or predict the appropriate decision when you specify decision alternatives. Neural Network Node: uses flexible modeling techniques to detect sophisticated non-linear relationship in data. The Neural Network Node supports generalized linear models, multi-layer perceptrons, and a range of radial basis function architectures. User Define Model Node: enables the user to incorporate models built outside of Enterprise Miner and compare them to models generated by Enterprise Miner. Ensemble Node: combines predictions from any of the models above in an optimal way. Assessment: Assessment statistics are automatically computed when a model is trained in one of the modeling nodes. The Assessment Node: provides a common framework to compare models generated by the modeling nodes. The common criterion for all modeling nodes is a comparison of the expected to actual profits obtained from model results. Visualization of results is given by lift charts, profit charts, ROI charts, and interactive profit/loss assessment charts. The Score Node: Scoring is the process of applying the chosen model to produce predictions for each case in a new data set that may not contain a target. These can be new customers, checking transactions to be fraudulent etc. The Reporter Node: assembles the results from an Enterprise Miner process flow analysis into an HTML report that can be viewed with your favorite web browser. Utilities:
In order to make the data mining process even easier to manage, Enterprise Miner provides multiple utility nodes to add extra functionality. These are: The SAS Code Node: extends the functionality of Enterprise Miner to allow the inclusion of any other functionality in the SAS System. The Data Mining Database Node: creates a data mining database (DMDB), which contains metadata and enhances performance by reducing the number of data passes necessary for data mining. The Group Processing Node: enables the user to define group variables, such as gender, to obtain separate analyses for each level of the grouping variable(s). It also enables the user to perform bootstrapping, bagging, and boosting. Control Point Node and Subdiagram Node: used to simplify the complex process flow diagrams. Contact Information: SAS Institute (Canada) BCE Place 181 Bay Street 22 nd Floor Toronto, Ontario M5J 2T3 (416) 363-4424 www.sas.com