Investigative Data Mining in Fraud Detection

Transcription

1 Investigative Data Mining in Fraud Detection Chun Wei Clifton Phua BBusSys A Thesis submitted in partial fulfilment of the requirement for the Degree of Bachelor of Business Systems (Honours) School of Business Systems Monash University November 2003

2 Chun Wei Clifton Phua 2003

3 Declaration I, Chun Wei Clifton Phua, declare that this thesis contains no material that has been accepted for the award of any other degree or diploma in any university or other institutions. To the best of my ability and belief, this thesis contains no material previously published or written by another person, except where due reference is made in the text. 3 rd November 2003 i

4 Acknowledgements My supervisor, Dr. Damminda Alahakoon, has been very encouraging and helpful. I believe that he has more patience than most people I know. My deepest appreciation to him for ensuring that resources are available for me, and also consistently giving me sound, valuable advice. The School of Business Systems, Faculty of Information Technology and Monash University have given me excellent financial aid, opportunities and resources. In particular, I want to express my utmost gratitude to Dr. Dineli Mather. Through her recommendation, I obtained special departmental help before commencing my honours year. Without her, this thesis will never have started. Other departmental staff members especially Assoc. Prof. Kate Smith, Dr. Leonid Churilov, Dr. Ai Cheo Yeo, and especially Ms. Fay Maglen have extended important help in my honours year. My appreciation goes to both Automated Learning Group (ALG) at National Centre of Supercomputing Applications (NCSA) and the Angoss Software Corporation for providing their Data to Knowledge TM (D2K) and KnowledgeSeeker TM IV tools for free. And I extend the appreciation to my reliable friends, Ellyne, Edwin, Sandy, and others who gave careful and useful comments on the thesis drafts. To my excellent study partner and confidante, Sheila, I offer my heartfelt thanks to her for always sharing my ups and downs since my first day in Monash. To the Monash International Student Navigators, Kean, Kar Wai, Hanny and many others, I am grateful for all their cherished friendships and kindness. To close friends from Monash College, Kenneth, Lynnette, Kelly and others, I often looked forward to our occasional feasts. To my fellow tutors, Mary, Nic, Prasanna and others, I had the pleasure to work and learn from all of them. To my honours gang, Ronen, Cyrus, Nelson, and others, I truly enjoyed our interesting discussions and nights of work together. I wish every person mentioned here happiness. ii

5 To my brother, Yat, I am delighted to be the beneficiary of his wonderful gifts over these years and have to commend him for spending more than five hours trying to decipher this dissertation. My uncle, Long and aunties, Patricia and Elsie, welcomed me into their homes where I stayed for about three years. I am indebted to them. Without them, living in Melbourne was financially impossible. My parents made many personal sacrifices for me to get an overseas education but have never put any kind of pressure on me. At this point in time, I hope that enough has been done in my studies to make them feel proud, not only of me, but also of themselves. iii

6 Abstract The purpose of this dissertation is to determine the most appropriate data mining methodology, methods, techniques and tools to extract knowledge or insights from enormous amounts of data to detect white-collar crime. Fraud detection in automobile insurance is used as the application domain. The focus is on overcoming the technical problems and alleviating the practical problems of data mining in fraud detection. The technical obstacles are due to imperfect, highly skewed data, and hard-to-interpret predictions. The practical barriers are caused by the dearth in domain knowledge, many evolving fraud patterns, and the weaknesses in some evaluation metrics. The problem solving approach is to integrate database, machine learning, neural networks, data visualisation, statistics, and distributed data mining techniques and tools into the crime detection system, which is based on the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology. The crime detection method utilised naive Bayesian, C4.5, backpropagation learning algorithms, and the Self Organising Map; classification and clustering visualisations; t-tests and hypothesis tests; bagging and stacking to construct predictions and descriptions of fraud. The results in the automobile insurance data set confirm the effectiveness of the crime detection system. Specifically, the stacking-bagging experiment achieved higher cost savings than the other nine, and the most statistically significant insight is the discovery of 21 to 25 year old fraudsters who use sport cars as their crime tool. This thesis demonstrates that the crime detection system has the potential to significantly reduce loss from illegitimate behaviour. iv

7 Table of Contents Acknowledgements... ii Abstract... iv Table of Contents... v List of Tables... viii List of Figures... ix CHAPTER 1 INTRODUCTION INVESTIGATIVE DATA MINING FRAUD DETECTION PROBLEMS OBJECTIVES SCOPE CONTRIBUTIONS OUTLINE... 8 CHAPTER 2 BACKGROUND EXISTING FRAUD DETECTION METHODS Insurance Fraud Credit Card Fraud Telecommunications Fraud Analysis of Methods THE NEW INVESTIGATIVE DETECTION METHOD Precogs Integration Mechanisms Analytical Machinery Visual Symbols Analysis of New Method SUPPORTING CRIMINAL DETECTION TECHNIQUES Bayesian Belief Networks Decision Trees Artificial Neural Networks Analysis of Techniques EVALUATION AND CONFIDENCE v

8 CHAPTER 3 THE CRIME DETECTION METHOD STEP ONE: CLASSIFIERS AS PRECOGS Naive Bayesian Classifiers C4.5 Classifiers Backpropagation Classifiers Analysis of Algorithms STEP TWO: COMBINING OUTPUT AS INTEGRATION MECHANISMS Cross Validation Bagging Stacking STEP THREE: CLUSTER DETECTION AS ANALYTICAL MACHINERY Self Organising Maps Analysis of Self-Organisation STEP FOUR: VISUALISATION TECHNIQUES AS VISUAL SYMBOLS Classification Visualisation Clustering Visualisation SUMMARY OF NEW METHOD CHAPTER 4 IMPLEMENTING THE CRIME DETECTION SYSTEM: PREPARATION PHASE ONE: PROBLEM UNDERSTANDING Determine the Investigation Objectives Assess the Situation Determine the Data Mining Objectives Produce the Project Plan PHASE TWO: DATA UNDERSTANDING Describe the Data Explore the Data Verify the Data Quality PHASE THREE: DATA PREPARATION Select the Data Clean the Data Format the Data Construct the Data Partition the Data vi

9 4.4 KEY ISSUES IN PREPARATION CHAPTER 5 IMPLEMENTING THE CRIME DETECTION SYSTEM: ACTION PHASE FOUR: MODELLING Generate the Experiment Design Build the Models Assess the Models PHASE FIVE: EVALUATION Evaluate the Results Review the Process PHASE SIX: DEPLOYMENT Plan the Deployment Plan the Monitoring and Maintenance CHAPTER 6 CONCLUSION RECOMMENDATIONS Overcome the Technical Problems Alleviate the Practical Problems FUTURE RESEARCH DIRECTIONS Credit Application Fraud Mass Casualty Disaster Management APPENDICES REFERENCES vii

10 List of Tables Table 2.1: Features in Fraud Detection Methods Table 2.2: Strengths and Weaknesses of Data Mining Techniques Table 3.1: Extracting Rules from a Decision Tree Table 3.2: Net Input and Output Calculations Table 3.3: Calculation of the Error at each Neuron Table 3.4: Calculations for Weight Updating Table 3.5: Bagging Predictions from Different Algorithms Table 3.6: Evaluation of Fraud Detection Capability of Algorithms within Clusters Table 4.1: Costs of Predictions Table 5.1: The Experiments Plan Table 5.2: The Tests Plan Table 5.3: Bagged Success Rates versus Averaged Success Rates Table 5.4: Best Determining Training Set Distribution for Data Partitions Table 5.5: Bagging versus Stacking versus Conventional Backpropagation Table 5.6: Claim Handling Actions and Thresholds Table 5.7: Rule Induction on Training Data Set Table 5.8: Ranking of Experiments using Cost Model Table 5.9: Descriptions of Existing Fraud Table 5.10: Statistically Strong Indicators of Fraud viii

11 List of Figures Figure 2.1: Predictions using Precogs, Analytical Machinery, and Visual Symbols Figure 3.1: Steps in the Crime Detection Method Figure 3.2: Fully Grown Decision Tree Figure 3.3: Pruned Decision Tree Figure 3.4: Multilayer Feedforward Neural Network Figure 3.5: Building and Applying One Classifier using Data Partitions Figure 3.6: Stacking Classifiers from Examples Figure 3.7: Stacking Classifiers from Instances Figure 3.8: The Self Organising Map Figure 3.9: The Confusion Matrix Figure 4.1: Phases of the Crime Detection System Figure 4.2: Claim Trends by Month Figure 4.3: Proportion of Fraud within the Vehicle Age Category Figure 4.4: Proportion of Fraud within the Policy Holder Age Category Figure 4.5: Dividing Data into 50/50 Distribution for Each Partition Figure 5.1: Prioritisation of Fraud Data Instances in Organisations Figure 5.2: Attribute Input Strength from the Backpropagation Algorithm Figure 5.3: Claim Trends by Month for Figure 5.4: Proportion of Fraud within a Vehicle Age Category for ix

12 Life is change, that is how it differs from the rocks, change is its very nature. - John Wyndham, 1955, The Chrysalids For my parents Chye Twee and Siok Moy x

13 CHAPTER 1 Ask me when fraud will stop and the answer is never. Fraud has become one of the constants of life. INTRODUCTION - Frank Abagnale, 2001, The Art of the Steal: How to Protect Yourself and Your Business from Fraud The world is overwhelmed with millions of inexpensive gigabyte disks containing terabytes of data. It is estimated that these data stored in all corporate and government databases worldwide doubles every twenty months. The types of data available, in addition to the size, are also growing at an alarming rate. Some relevant examples can put this situation into perspective: In the United States (US), DataBase Technologies (DBT) Online Incorporated contains four billion records used by its law enforcement agencies. The Insurance Services Office Incorporated (ISO) claim search database contains over nine billion US claim records, with over two billion claims records added annually (Converium, 2002). In Australia, the Insurance Reference Service (IRS) industry database consists of over thirteen million individual insurance claims records (Converium, 2002). The Law Enforcement Assistance Program (LEAP) database for the Victorian police in Australia details at least fourteen million vehicle, property, address and offender records, with at least half a million criminal offences and incidents added annually (Mickelburough and Kelly, 2003). Private firms sell all types of data on individuals and companies, often in the forms of demographic, real estate, utility usage, telecom usage, automobile, credit, criminal, government, and Internet data (Infoglide Software Corporation, 2002). This results in a data rich but information poor situation where there is a widening gap between the explosive growth of data and its types, and the ability to analyse and interpret it. Hence there is a need for a new generation of automated and intelligent tools and techniques (Fayyad et al, 1996), known as 1

14 investigative data mining, to look for patterns in data. These patterns can lead to new insights, competitive advantages for business, and tangible benefits for society. This chapter introduces investigative data mining in Section 1.1 and outlines the research problems in Section 1.2. While Section 1.3 covers the objectives, Section 1.4 defines the scope of the research. A summary of the research contributions is provided in Section 1.5. Section 1.6 concludes with the outline of the other chapters in this thesis. 1.1 INVESTIGATIVE DATA MINING Data mining is the process of discovering, extracting and analysing of meaningful patterns, structure, models, and rules from large quantities of data (Berry and Linoff, 2000). The process (see Appendix B) is automatic or semi-automatic, with interactive and iterative steps such as problem and data understanding, data selection, data preprocessing and cleaning, data transformation, incorporation of appropriate domain knowledge to select data mining task and algorithm, application of data mining algorithm(s), and knowledge interpretation and evaluation. The last step is either the refinement by modifications or the consolidation of discovered knowledge (Fayyad et al, 1996). Discovered patterns, or insights, should be statistically reliable, not known previously, and actionable (Elkan, 2001). The data mining field spans several research areas (Cabena et al, 1998) with stunning progress over the last decade. Database theories and tools provide the necessary infrastructure to store, access and manipulate data. Artificial intelligence research such as machine learning and neural networks is concerned with inferring models and extracting patterns from data. Data visualisation examines methods to easily convey a summary and interpretation of the information gathered. Statistics is used to support and negate hypotheses on collected data and control the chances and risks that must be considered upon making generalisations. Distributed data mining deals with the problem of learning useful new information from large and inherently distributed databases where multiple models have to be combined. 2

15 The most common goal of business data mining applications is to predict customer behaviour. However this can be easily tailored to meet the objective of detecting and preventing criminal activity. It is almost impossible for perpetrators to exist in this modern era without leaving behind a trail of digital transactions in databases and networks (Mena, 2003b). Therefore, investigative data mining is about systematically examining, in detail, hundreds of possible data attributes from such diverse sources as law enforcement, industry, government, and private data provider databases. It is also about building upon the findings, results and solutions provided by the database, machine learning, neural networks, data visualisation, statistics, and distributed data mining communities, to predict and deter illegitimate activity. 1.2 FRAUD DETECTION PROBLEMS The Oxford English dictionary defines fraud as criminal deception, the use of false representations to obtain an unjust advantage, and to injure the rights and interests of another. Abagnale (2001) explains that technological advancements like the Internet make it easier to commit fraud. The same author also states that the situation is exacerbated by little legal deterrence (as courts rarely put fraudsters in jail), and the societal indifference to white collar crime. Fraud takes many diverse forms, and is extremely costly to society. It can be classified into three main types, namely, against organisations, government and individuals (see Appendix C). This thesis focuses on fraud against organisations. KPMG s (2002a; 2002b) fraud surveys about Australian and Singaporean organisations detail the severity of scams committed by external parties, internal management, and non-management employees. In the Australian fraud survey, about one third of fraud losses are caused by external parties like customers, service providers and suppliers. Data mining can minimise some of these losses because of the massive collections of customer data. Insurance fraud is shown to be the most critical area, followed by credit card fraud, cheque forgery, 3

16 and telecommunications fraud (see Appendix C). Within insurance fraud, automobile, travel, household contents and workers compensation insurance are more common (Baldock, 1997; O Donnell, 2002). Recently, automobile insurance fraud alone costs about AUD$32 million for nine insurance companies in Australia (KPMG, 2002a) and SGD$20 million for the industry in Singapore (Tan, 2003). Fraud detection poses some technical and practical problems for data mining; the most significant technical problem is due to limitations, or poor quality, in the data itself. The data is usually collected as a by-product of other tasks rather than for the purpose of fraud detection. Although one form of data collection standard for fraud detection has recently been introduced (National Association of Insurance Commissioners, 2003), not all data attributes are relevant for producing accurate predictions and some attribute values are likely to have data errors. Another crucial technical dilemma is due to the highly skewed data in fraud detection. Typically there are many more legitimate than fraudulent examples. This means that by predicting all examples to be legal, a very high success rate is achieved without detecting any fraud. Another negative consequence of skewed data is the higher chances of overfitting the data. Overfitting occurs when models high accuracy arises from fitting patterns in the training set that are not statistically reliable and not available in the score set (Elkan, 2001). Another major technical problem involves finding the best ways to make predictions more understandable to data analysts. The most important practical problem is a lack of domain knowledge, or prior knowledge, which reveals information such as the important attributes, the likely relationships and the known patterns. With some of the domain knowledge described in this and the following paragraph, the search time for using the data mining process can be reduced. Basically, fraud detection involves discovering three 4

17 profiles of fraud offenders (see Appendix C), each with constantly evolving modus operandi (Baldock, 1997). Average offenders can be of any gender or socio-economic group and they commit fraud when there is opportunity, sudden temptation, or when suffering from financial hardship. Criminal offenders are usually males and criminal records. Organised crime offenders are career criminals who are part of organised groups which are prepared to contribute considerable amount of time, effort and resources to perpetuate major and complex fraud. Using learning algorithms in data mining to recognise a great variety of fraud scenarios over time is a difficult undertaking. Fraud committed by average offenders is known as soft fraud, which is the hardest to mitigate because the investigative cost for each suspected incident is usually greater than the cost of the fraud (National White Collar Crime Center, 2003). Fraud perpetuated by the criminal and organised crime offenders is termed hard fraud and it circumvents anti-fraud measures and approximates many legal forms (Sparrow, 2002). The next practical problem includes assessing the potential for significant impact of using data mining in fraud detection. Success cannot be defined in terms of predictive accuracy because of the skewed data. 1.3 OBJECTIVES The main objective of this study is to survey and evaluate methods and techniques to solve the six fraud detection problems outlined in Section 1.2. Classification, cluster detection, and visualisation techniques are carefully examined and selected. These techniques are then hybridised into a method, envisioned in the science fiction book, Minority Report (Dick, 1956). It has the potential to allow data analysts to better understand the classifiers and their predictions. The method is incorporated, together with a cost model, statistics, and domain knowledge, into a comprehensive data mining system to eliminate the fraud detection problems. 5

18 1.4 SCOPE This thesis builds upon some well-known methods and techniques chosen from insurance, credit card, and telecommunication fraud detection. It concentrates solely on the naive Bayesian, C4.5, and backpropagation algorithms to generate classifiers which make predictions. To simplify the use of ensemble mechanisms on fraud detection classifiers, only the bagging and stacking strategies are applied on the classifier predictions. The Self Organising Map (SOM) and two-dimensional visualisation techniques, such as the confusion matrix and the naive Bayesian visualisation, are chosen to analyse and interpret the data and classifier predictions. The other important outputs include scores and rules. A simple cost model is preferred over a cost-sensitive algorithm. This research demonstrates the crime detection system s capability in an automobile insurance data set. In reality, data mining is only part of the solution to fraud detection. Process re-engineering (FairIsaac, 2003b), manual reviews by fraud specialists, interactive rule-based expert systems (Magnify, 2002b), and link analysis (Mena, 2003b) are also essential but beyond the scope of this thesis. Although this research uses the crime detection system for fraud, most of the system and its techniques are general for non-crime and other types of crime data. 1.5 CONTRIBUTIONS The main thesis contribution is the creation of the new crime detection method to predict and describe criminal patterns from data: The innovative use of naive Bayesian, C4.5, and backpropagation classifiers to process the same partitioned numerical data increases the chances of getting better predictions. The selection of the best classifiers of different algorithms using stacking and the merger of their predictions using bagging produces better predictions than the single algorithm approach. The SOM is introduced as a descriptive tool to understand particular characteristics of fraud within the clusters and also as an evaluation tool to assess the algorithms ability to cope with 6

19 evolving fraud. The visualisation of model predictions and results is expressed with the confusion matrix, naive Bayesian visualisation, column graph, decision tree visualisation, cumulative lift charts, radar graphs, scores, and rules. The other significant thesis contributions include: The introduction of the crime detection system, through modifying the CRISP-DM methodology, which provides a comprehensive and step-by-step approach to preparing for and carrying out data mining. The adoption of a cost model which allows the realistic measure of monetary cost and benefit within a data mining project. The demonstration of visualisations as an important part of attribute exploration before and after model building. The strong emphasis on statistics to objectively evaluate algorithms, classifiers, and cluster descriptions prevents inaccurate predictive accuracy estimates and poor model interpretation. The justification to use the score-based feature in fraud detection over the rule-based feature. The extensive literature review conducted on existing fraud detection methods, techniques and tools to seek out the best data mining practices. This resulted in the integration of methods and techniques across insurance, credit card, and telecommunications fraud detection. The in-depth analysis of naive Bayesian, C4.5, and backpropagation algorithms at the practical and conceptual levels to overcome their specific limitations. The critical issues of overfitting and missing values were addressed. 7

20 1.6 OUTLINE Chapter 2 contains existing fraud detection methods and techniques, the new crime detection method, the recommended three classification techniques, and the relevant statistical concepts. Chapter 3 focuses on the strengths and limitations of the three classification algorithms, two ensemble mechanisms, the role of self organisation to describe fraud and algorithm performance, and the relevant visualisation techniques. A summary of the new crime detection method is given. Chapter 4 prepares the data using the crime detection system phases of problem understanding, data understanding, and data preparation. Essential issues in preparation are highlighted. Chapter 5 applies the crime detection phases of modelling, evaluation, and deployment to the prepared data. Chapter 6 concludes with the summary of the research, recommendations for the research problems, and possible directions for future research. 8

21 CHAPTER 2 The Precrime System, the prophylactic pre-detection of criminals through the ingenious use of mutant precogs, capable of previewing future events and transferring orally that data to analytical machinery... - Philip K. Dick, 1956, Minority Report BACKGROUND Studies have shown that detecting clusters of crime incidents (Levine, 1999) and finding possible cause/effect relations with association rule mining (Estivill-Castro and Lee, 2001), are important to criminal analysis. Yet, the classification techniques have also proven to be highly effective in fraud detection (He et al, 1998; Chan et al, 1999) and can be used to categorise future crime data and to provide a better understanding of present crime data. This chapter describes and evaluates the existing fraud detection methods and techniques in Section 2.1. The need for a new and innovative method to crime detection is introduced and discussed in Section 2.2. The choice of three classification techniques in crime detection is justified in Section 2.3. Finally, Section 2.4 concludes the chapter by specifying the statistical tests that must be used to evaluate the performance of classifiers. 2.1 EXISTING FRAUD DETECTION METHODS This section concentrates on the analysis of present data mining methods applied specifically to the data-rich areas of insurance, credit card, and telecommunications fraud detection. A brief description of each method and its applications is given and some commercial fraud detection software is examined. The methods are critically evaluated to determine which is more widely used in fraud detection Insurance Fraud Ormerod et al (2003) recommends the use of dynamic real-time Bayesian Belief Networks (BBNs), named Mass Detection Tool (MDT), for the early detection of potentially fraudulent claims, that is then used by a rule generator named Suspicion Building Tool (SBT). The weights of the BBN are 9

22 refined by the rule generator s outcomes and claim handlers have to keep pace with evolving frauds. This approach evolved from ethnology studies of large insurance companies and loss adjustors who argued against the manual detection of fraud by claim handlers. The hot spot methodology (Williams and Huang, 1997) applies a three step process: the k-means algorithm for cluster detection, the C4.5 algorithm for decision tree rule induction, and domain knowledge, statistical summaries and visualisation tools for rule evaluation. It has been applied to detect health care fraud by doctors and the public for the Australian Health Insurance Commission. Williams (1999) has expanded the hot spot architecture to use genetic algorithms to generate rules and to allow the domain user, such as a fraud specialist, to explore the rules and to allow them to evolve according to how interesting the discovery is. Brockett et al (1998) presented a similar methodology utilising the SOM for cluster detection before backpropagation neural networks in automobile injury claims fraud. The use of supervised learning with backpropagation neural networks, followed by unsupervised learning using SOM to analyse the classification results, is recommended by He et al (1998). Results from clustering show that, out of the four output classification categories used to rate medical practice profiles, only two of the well defined categories are important. Like the hotspot methodology, this innovative approach was applied on instances of the Australian Health Insurance Commission health practitioners profiles. Von Altrock (1995) suggested a fuzzy logic system which incorporates the actual fraud evaluation policy using optimum threshold values. It outputs the degree of likelihood of fraud and gives reasons why an insurance claim is possibly fraudulent. Through experimentation with one thousand two hundred insurance claims that belonged to an anonymous company, the results showed that the fuzzy logic system predicted marginally better than the experienced auditors. 10

23 Cox (1995) proposed another fuzzy logic system which uses two different approaches to mimic the common-sense reasoning of fraud experts: the discovery model and the fuzzy anomaly-detection model. The first uses an unsupervised neural network to learn the natural relationships in the data and to derive significant clusters. A neuro-fuzzy classification system is then used to identify patterns within the clusters. The second uses the Wang-Mendel algorithm to generate a fuzzy model. It was primarily applied to search and explain the reasons for health care providers committing fraud against insurance companies. The EFD system (Major and Riedinger, 1995) is an expert system in which expert knowledge is integrated with statistical information assessment to identify providers whose behaviour does not fit the norm. It also searches in the power set of heuristics to find better classification rules. EFD has been used to detect insurance fraud in twelve US cities. FraudFocus Software (Magnify, 2002a) automatically scores all claims and continually rescores them as more are added. Claims are prioritised in descending fraud potential to get relevant attention and descriptive rules are generated for fraudulent claims. SAS Enterprise Miner Software (SAS e-intelligence, 2000) depends on association rules, cluster detection and classification techniques to detect fraudulent claims. It compares the expected results with the actual results so that large deviations can be further investigated. One of its successful applications saved four million dollars for an anonymous US health insurer with over four million members Credit Card Fraud The BBN and Artificial Neural Network (ANN) comparison study (Maes et al, 2002) uses the STAGE algorithm for BBNs and backpropagation algorithm for ANNs in fraud detection. Comparative results show that BBNs were more accurate and much faster to train, but BBNs are slower when applied to 11

24 new instances. Real world credit card data was used but the number of instances is unknown. The distributed data mining model (Chan et al, 1999) is a scalable, supervised black box approach that uses a realistic cost model to evaluate C4.5, CART, Ripper and naive Bayesian classification models. The results demonstrated that partitioning a large data set into smaller subsets to generate classifiers using different algorithms, experimenting with fraud/legal distributions within training data and using stacking to combine multiple models significantly improves cost savings. This method was applied to one million credit card transactions from two major US banks, Chase Bank and First Union Bank. The neural data mining approach (Brause et al, 1999) uses generalised rule-based association rules to mine symbolic data and Radial Basis Function neural networks to mine analog data. It has found that using supervised neural networks to check the results of association rules increases the predictive accuracy. The source of the credit card data was unknown and over fifty thousand transactions were used for training. The credit fraud model (Groth, 1998) recommends a classification approach when there is a fraud/legal attribute, or a clustering followed by a classification approach if there is no fraud/legal attribute. The HNC (now known as FairIsaac) Falcon Fraud Manager Software (Weatherford, 2002) recommends backpropagation neural networks for fraudulent credit card use Telecommunications Fraud The Advanced Security for Personal Communications Technologies (ASPECT) research group (Weatherford, 2002) focuses on neural networks, particularly unsupervised ones, to train legal current user profiles that store recent user information and user profile histories that store long term information to define normal patterns of use. Once trained, fraud is highly probable when there is a 12

25 difference between a mobile phone user s current profile and the profile history. Cahill et al (2002) builds upon the adaptive fraud detection framework (Fawcett and Provost, 1997) by using an event-driven approach of assigning fraud scores to detect fraud as it happens, and weighting recent mobile phone calls more heavily than earlier ones. The Cahill et al (2002) framework can also detect types of fraud using rules, in addition to detecting fraud in each individual account, from large databases. This framework has been applied to both wireless and wire line fraud detection systems with over two million customers. The adaptive fraud detection framework presents rule-learning fraud detectors based on accountspecific thresholds that are automatically generated for profiling the fraud in an individual account. The system, based on the framework, has been applied by combining the most relevant rules, to uncover fraudulent usage that is added to the legitimate use of a mobile phone account (Fawcett and Provost, 1996; Fawcett and Provost, 1997) Analysis of Methods Table 2.1 illustrates the features of fraud detection methods used by these three fraud types in research and practice. Supervised learning is an approach in which the algorithm s answer to each input pattern is directly compared with the known desired answer, and feedback is given to the algorithm to correct possible errors. On the other hand, unsupervised learning is another approach that teaches the algorithm to discover, by itself, correlations and similarities among the input patterns of the training set and to group them into different clusters. There is no feedback from the environment with which to compare answers. The score-based approach uses numbers with a specified range, which indicates the relative risk that a particular instance may be fraudulent, to rank instances. The rule-based approach uses rules that are expressions of the form Body Head, where Body describes the conditions under which the rule is generated and Head is typically a class label. 13

26 Table 2.1 depicts supervised learning as a more popular approach than unsupervised learning, and score-based approach as more widely used in credit card fraud than rule-based approach. All the research described before uses some form of supervised learning to detect fraud. For unsupervised learning, Silipo (2003) points out that the organisation of input space and the dimensionality reduction provided by unsupervised learning algorithms would accelerate the supervised learning process. Berry and Linoff (2000) also advocates the use of unsupervised learning to find new clusters and new insights which can improve the supervised learning results. Indeed, much of the research outlined above first utilised unsupervised learning to derive clusters, then used supervised learning approaches to obtain scores or rules from each cluster. The only exception was He et al s (1998) work that used unsupervised learning subsequently to improve the initial supervised learning results. Only one of the seventeen fraud detection methods, the SAS Enterprise Miner Software, allows the use of all the supervised and unsupervised learning, score-based and rule-based approaches. Table 2.1: Features in Fraud Detection Methods Main Fraud Types Supervised Unsupervised Score-Based Rule-Based Learning Learning Insurance Fraud 100% 56% 67% 67% Credit Card Fraud 100% 20% 80% 50% Telecommunications Fraud 100% 33% 67% 67% All 100% 41% 71% 59% 2.2 THE NEW INVESTIGATIVE DETECTION METHOD This section proposes a different but non-trivial method of detecting crime based partially on Minority Report (Dick, 1956). The idea is to simulate the book s Precrime method of precogs, integration mechanisms, analytical machinery, and visual symbols, with existing data mining methods and techniques. An overview of how the new investigative detection method can be used to predict crime, and its advantages and disadvantages are discussed. 14

27 Precog P 1 = L 1 (D) Examples and Instances Precog P 2 = L 2 (D) Main Predictions + Predictions Precog P 1 = L 1 (P 1, P 2, P 3 ) D Precog P 3 = L 3 (D) Attribute Selection Main Predictions Final Predictions Graphs and Scores Analytical Visual Symbols Machinery CL = L 4 (D) Rules Figure 2.1: Predictions using Precogs, Analytical Machinery, and Visual Symbols Precogs Precogs, or precognitive elements, are entities that have the knowledge to predict that something will happen. Figure 2.1 uses three precogs to foresee and prevent crime by stopping potentially guilty criminals (Dick, 1956, p19). Unlike the human mutant precogs (Dick, 1956, p3), each precog contains multiple classification models, or classifiers, trained with one data mining technique (see Section 2.3) in order to extrapolate the future. This is a top-down approach (Berry and Linoff, 2000) because the organisation aims to predict certain types of crime. The three precogs proposed here are different from each other in that they are trained by different data mining algorithms. For example, the first, second, and third precog are trained using naive Bayesian, C4.5 and backpropagation algorithms respectively. They require numerical inputs of past examples to output corresponding class predictions for new instances Integration Mechanisms Figure 2.1 shows that as each precog outputs its many predictions for each instance, all are counted and the class with the highest tally is chosen as the main prediction (Breiman, 1994). The main 15

28 predictions can be combined either by majority count or the predictions can be fed back into one of the precogs (Wolpert, 1992), to derive a final prediction Analytical Machinery Dick s (1956, p3) analytical machinery is made up of three computers that record, study, compare, and represent the precogs predictions in easily understood terms. The first two computers are of similar models. However, if both produce different conclusions about the final predictions on each instance, the third, by means of statistical analysis, is then used to check the results of the other two (Dick, 1956, p20). Instead of three computers, the analytical machinery is simplified by emulating one computer with one type of unsupervised learning, the SOM, for grouping the similar data into clusters. This is a bottom-up approach (Berry and Linoff, 2000) that allows the data to speak for itself in patterns. The data analyst assesses the performance of the classifiers within each cluster and decides which crime patterns are important Visual Symbols Dick s (1956, p5) analytical machinery produces, in word form, details of a crime based on the final prediction. Instead, graphical visualisations, numerical scores, and descriptive rules about the final prediction, are used. They are important for explaining and understanding the main and final predictions Analysis of New Method Figure 2.1 demonstrates that, to effectively and efficiently predict future crime, supervised and unsupervised learning techniques have to be used concurrently. Unseen numerical crime data that has no class label is simultaneously fed into the trained precogs and analytical machinery. One of the precogs identifies which attributes have the most predictive value after being trained and chooses a subset of attributes for the analytical machinery from which to generate visual symbols. The analytical machinery groups the instances into clusters. At the same time, each precog produces its predictions 16

29 on an instance which are then combined into a main prediction or by one of the three precogs, into a final prediction. The main predictions and final prediction of each instance are then appended to the clustered instances. From this, visual symbols are generated to explain the precogs predictions and the most relevant visual symbols are kept and analysed. The black box approach for using precogs to generate predictions has been transformed into a semitransparent approach by using analytical machinery to analyse and interpret the results. Visual symbols, scores, and rules help the data analysts to comprehend the predictions faster. The amount of required effort for data preparation is reduced as the same numerical instances are used as inputs for all the three different precogs and for the analytical machinery. Considerable time is saved because predictions and clusters are simultaneously computed. The key advantage lies in the fact that precogs can be shared between organisations to increase the accuracy of predictions, without violating competitive and legal requirements. However, the problem of overfitting and missing values must be addressed; and the predictive accuracy from the precogs must be compared, based on statistical concepts, before this method can be really effective. 2.3 SUPPORTING CRIMINAL DETECTION TECHNIQUES This section focuses on the directed or supervised data mining approach to crime detection using BBNs, decision trees and ANNs. A brief description is provided of each technique, its applications in combating crime, and the use of its most widely used algorithm to classify a new instance. A summary of each technique s main advantages and disadvantages is presented. The benefits of using the three different techniques together on the same crime data are also highlighted Bayesian Belief Networks BBNs, based on Bayes (1763) theorem, provide a graphic model of causal relationships on which they predict class membership probabilities (Han and Kamber, 2000), so that a given instance is legal or fraudulent (Prodromidis, 1999). One type of Bayesian category, known as the naive Bayesian 17

30 classification, assumes that the attributes of an instance are independent of each other, given the target attribute (Minsky and Papert, 1969; Feelders, 2003). The main objective here is to assign a new instance to the class that has the highest posterior probability. Although the naive Bayesian algorithm is simple, it is very effective in many real world data sets because it can give better predictive accuracy than well known methods like C4.5 decision trees and backpropagation (Domingos and Pazzani, 1996; Elkan, 2001) and is extremely efficient in that it learns in a linear fashion using ensemble mechanisms, such as bagging and boosting, to combine classifier predictions (Elkan, 1997). However, when attributes are redundant and not normally distributed, the predictive accuracy is reduced (Witten and Frank, 1999) Decision Trees Decision trees are machine learning techniques that express a set of independent attributes and a dependent attribute, in the form of a tree-shaped structure that represents a set of decisions (Witten and Frank, 1999). Extracted from decision trees, classification rules are IF-THEN expressions in which the preconditions are logically ANDed together and all the tests have to succeed if each rule is to be generated. Various related applications include the analysis of instances from drug smuggling, government financial transactions (Mena, 2003b), and customs declaration fraud (Shao et al, 2002), to more serious crimes such as drug-related homicides, serial sex crimes, stranger rapes (SPSS, 2003), and homeland security (James, 2002; Mena, 2003a). The main idea of using decision trees is to use C4.5 (Quinlan, 1993) to divide data into statistically significant segments based on desired output and to generate graphic decision trees or descriptive classification rules that can be used to classify a new instance. C4.5 can help not only to make accurate predictions from the data but also to explain the criminal patterns in it. It deals with the problems of the numeric attributes, missing values, pruning, estimating error rates, complexity of decision tree induction, and generating rules from trees (Witten and Frank, 18

31 1999). In terms of predictive accuracy, C4.5 performs slightly better than CART and ID3 (Prodromidis, 1999). C4.5 s successor, C5.0, shows marginal improvements to decision tree induction but not enough to justify its use. The learning and classification steps of C4.5 are generally fast (Han and Kamber, 2000). However, scalability and efficiency problems, such as the substantial decrease in performance and poor use of available system resources, can occur when C4.5 is applied to large data sets Artificial Neural Networks ANNs represent complex mathematical equations, with lots of summations, exponential functions, and many parameters to mimic neurons from the human brain (Berry and Linoff, 2000). They have been used to classify crime instances such as burglary, sexual offences, and known criminals facial characteristics (Mena, 2003b). An artificial neural network (Rosenblatt, 1958) is a set of connected input/output units in which each connection has an associated weight. The main objective is to use the backpropagation learning algorithm (Rumelhart and McClelland, 1986) to make the network learn by adjusting and finalising the weights so that it can be used to classify a new instance. Backpropagation neural networks can process a very large number of instances, have a high tolerance to noisy data and the ability to classify patterns on which they have not been trained (Han and Kamber, 2000). They are an appropriate choice for some crime detection areas where the results of the model are more important than understanding how it works (Berry and Linoff, 2000). However, backpropagation neural networks require long training times and extensive testing and retraining of parameters, such as the number of hidden neurons, learning rate and momentum, to determine the best performance (Bigus, 1996) Analysis of Techniques Table 2.1 illustrates that each technique is intrinsically different from the other, according to the evaluation criteria, and has its own strengths and weaknesses. Interpretability refers to how much a 19

32 domain expert or non-technical person can understand each of the model predictions through visualisations or rules. Effectiveness highlights the overall predictive accuracy and performance of the each technique. Robustness assesses the ability to make correct predictions given noisy data or data with missing values. Scalability refers to the capability to construct a model efficiently given large amounts of data. Speed describes how effective it is in terms of how fast a technique searches for patterns that make up the model. By using the three techniques together on the same data, within the context of classification data analysis, their strengths can be combined and their weaknesses reduced. BBNs could be used for scalability and speed, decision trees for interpretability, and ANNs for its effectiveness and robustness. Table 2.2: Strengths and Weaknesses of Data Mining Techniques Data Mining Techniques Interpretability Effectiveness Robustness Scalability Speed Bayesian Belief Networks Good Good Good Excellent Excellent Decision Trees Excellent Good Good Poor Good Neural Networks Poor Excellent Excellent Excellent Poor 2.4 EVALUATION AND CONFIDENCE Experiments described in this thesis split the main data set into a training data set and a scoring data set. The class labels of the training data are known, and the training data is historical compared to the scoring data. The class labels of the score data set are not known, and the score data set is then processed by the classifiers for actual predictions. An evaluation is made to assess how each classifier C i performs and to systematically compare one classifier with another (Witten and Frank, 1999). The true but unknown success rate, T, measures the proportion of successes or correct predictions made by classifiers on the score data set. T lies within a certain specified interval of 2z, where z is the standard deviation from the mean, with a certain specified confidence. Given x number of instances from the score data set and S number of successes, the observed success rate o i of a classifier is o i = S i /x. 20

33 Witten and Frank (1999) recommend the use of the paired Student s t-test with k-1 degrees of freedom, to compare with confidence between two learning algorithms L A and L B, using the k-fold cross-validation method (see Section 3.2.1). It is applied to the differences of the observed success rates o A,i and o B,i of each pair of their classifiers C A,i and C B,i derived in each of the k runs, 1 i k. Given that o i = o A,i o B,i and o = o A - o B, the null hypothesis, H 0, states that L A and L B have the same success rate if o = 0. To reject H 0 in favour of the alternate hypothesis, H 1, which states that L A and L B have different performance with c confidence, if is greater than t 1+ c k 1, 2 assuming k = 11, t10, is equal to o k t = k i = 1 i k 1 ( o o ) 2 for the 2-tailed t-test, 0 < c < 1. For example, with 99% confidence, and According to Prodromidis (1999), the o i from a classifier can be affected by reasons such as the selection of training data, the random variation of score data, and the randomness within a data mining algorithm. When two classifiers, C A and C B, exhibit different o A and o B respectively on x, it may not mean that they have different predictive performances. In order to evaluate the results, C A and C B, with confidence, Salzberg (1997), Prodromidis (1999), Elkan (2001) suggest the use of the McNemar s hypothesis test on the classifiers results from x. Given that the number of instances correctly classified only by C A is represented by s A, and the number of instances correctly classified only by C B is represented by s B, the null hypothesis, H 0, that states that C A and C B have the same success rate if s A = s B. To reject H 0 in favour of the alternate hypothesis, H 1, which states that C A and C B have different performances with c confidence, if s = ( sa - sb - 1) ( sa + sb) 2 is greater than 2 1, c x, 0 < c < 1. For example, with 99% confidence, x 2 is equal to The 2 x 1, represents a chi-squared distribution with 1 degree of freedom. The -1 in the formula is the continuity 21

34 correction for the fact that s A and s B are discrete while the chi-squared distribution is continuous. Elkan (2001) lamented the dearth of awareness and understanding among data mining researchers and practitioners regarding statistical significance. He argued that the problem lies in training data sets with small numbers of the rare class, such as fraudulent examples. To avoid inaccurate predictive accuracy estimates and poor model interpretation, this thesis aims to determine if predictive relationships are statistically reliable in Section

35 CHAPTER 3 To solve really hard problems, we will have to use several different representations... Each have domains of competence and efficiency, so that one may work where another fails. - Marvin Minsky, 1990, From Logical vs. Analogical THE CRIME DETECTION METHOD The primary goals of investigative data mining are to predict and describe criminal patterns from observed data. According to Fayyad et al (1996), prediction involves using available data attributes to extrapolate values from other unknown data attributes, and description concentrates on finding humaninterpretable patterns explaining the data. These two goals are achieved in steps which provide the core capability of generalising large numbers of specific facts into new knowledge shown in Figure 3.1. This chapter scrutinises three learning algorithms in Section 3.1, using their classifiers to predict occurrences of fraud. Section 3.2 illustrates two ways of improving their predictions by integrating multiple classifiers. An ANN approach to clustering is presented as a new descriptive approach in Section 3.3. Section 3.4 chooses some existing visualisation techniques to describe the patterns and Section 3.5 ends the chapter with an in-depth analysis of the overall crime detection method. 3.1 Step One: Classifiers 3.2 Step Two: Combining Output 3.3 Step Three: Cluster Detection 3.4 Step Four: Visualisation Techniques Figure 3.1: Steps in the Crime Detection Method 23

36 3.1 STEP ONE: CLASSIFIERS AS PRECOGS This section applies each algorithm to small data sets to show its importance in fraud detection. In progressive steps, the likely problems in each algorithm are explained and effective solutions to overcome them are proposed. The advantages and disadvantages of using the three algorithms in fraud detection are also presented Naive Bayesian Classifiers The naive Bayesian classifier learns according to the algorithm in Appendix G, using the training data in Appendix D. The classifier has to predict the class of instance X = (sex = 0, fault = 1, driver_rating = 0, number_of_suppliments = 0.33 ) to be either fraud or legal. According to Step 2 of the naive Bayesian algorithm, P P 2 20 i ( fraud ) = = 0. 1 = s s i ( legal ) = = 0. 9 = s s According to Step 3 of the naive Bayesian algorithm, P P P P P ( sex = "0" fraud) ( fault = "1" fraud) ( driver_rating = "0" fraud) 1 = ( number_of_suppliments = "0.33" fraud) n 0 = 0 2 ( X fraud) = P( xk fraud) = = 0 k = 1 0 = = = = 1 2 = = 24

37 According to Step 4 of the naive Bayesian algorithm, P P P P P ( sex = "0" legal) ( fault = "1" legal) 4 = = = = ( driver_rating = "0" legal) 6 18 = 0.33 ( number_of_suppliments = "0.33" legal) n = 0.11 ( X fraud) = P( xk legal) = = k = 1 = Using the above probabilities and Step 5 of the naive Bayesian algorithm, P P ( X fraud) P( fraud) = = 0 ( X legal) P( legal) = = = 2 18 Therefore, the naive Bayesian classifier predicts that instance X is legal. ( sex = "0" fraud) = 0 ( number_of_ suppliments = "0.33" fraud) = 0 P and P highlight the problem of any attribute value not present in all the fraud examples from the training set. The probabilities for ( X fraud) P is always 0, and poor score set results are inevitable. The Laplace estimator improves this situation by adding 1 to the numerator and the number of attribute values to the denominator of P ( X fraud) and P ( X legal) (Witten and Frank, 1999). To prove this, an assumption that each attribute value is equally probable has to be made when the algorithm is used. According to Step 3 of the naive Bayesian algorithm, P P P P P ( sex = "0" fraud) ( fault = "1" fraud) ( driver_rating = "0" fraud) = = = = = = = = ( number_of_suppliments = "0.33" fraud) ( X fraud) = =

38 According to Step 4 of the naive Bayesian algorithm, P P P P P ( sex = "0" legal) ( fault = "1" legal) ( driver_rating = "0" legal) = = = = = = = = ( number_of_suppliments = "0.33" legal) ( X legal) = = Using the above probabilities and Step 5 of the naive Bayesian algorithm, P P ( X fraud) P( fraud) = = ( X legal) P( legal) = = By converting the results into percentage format, P ( fraud / X ) = 13.55% and P ( legal / X ) = 86.45%. Therefore, instance X is about 6 times more likely to be legal. Since attributes are treated as though they are completely independent, the addition of redundant ones dramatically reduces its predictive power. The best way of relaxing this conditional independence assumption is to add derived attributes (Elkan, 2001). These attributes are created from combinations of existing attributes (see Section 4.3.4). To illustrate this, the naive Bayesian classifier will learn using the training data with two derived attributes in Appendix E. The classifier has to predict class of instance Y = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0, age_price_wsum = 0.33 ) instead. 26

39 According to Step 3 of the naive Bayesian algorithm, P P P P ( is_holiday_claim = "0" fraud) ( fault = "1" fraud) ( driver_rating = "0" fraud) = = = = = = = = ( age_price_wsum = "0.33" fraud) ( fraud) = = P Y According to Step 4 of the naive Bayesian algorithm, P P P P ( is_holidayweek_claim = "0" legal) ( fault = "1" legal) ( driver_rating = "0" legal) ( age_price_wsum = "0.33" legal) = = = = = = = = ( legal) = = 0. 3 P Y Using the above probabilities and Step 5 of the naive Bayesian algorithm, ( fraud) P( fraud) = = ( legal) P( legal) = = P Y P Y P ( fraud / Y ) = 0.25% and P ( legal / Y ) = 99.75%. Therefore, given two derived attributes as training examples, instance Y is about 400 times more likely to be legal. Given this small illustration and the correct class label is legal, using derived attributes can result in more accurate predictions. The naive Bayesian classifier is one of the few classifier types that can handle missing values in training examples well (Elkan, 1997; Witten and Frank, 1999). To demonstrate this, the naive Bayesian classifier learns based on the training data with nine missing values in Appendix F to predict the class of instance Y = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0, age_price_wsum = 0.33 ) instead. 27

40 According to Step 3 of the naive Bayesian algorithm, P P P P ( is_holiday_claim = "0" fraud) ( fault = "1" fraud) ( driver_rating = "0" fraud) = = = = = = = = ( age_price_wsum = "0.33" fraud) ( fraud) = = P Y According to Step 4 of the naive Bayesian algorithm, P P P P ( is_holidayweek_claim = "0" legal) ( fault = "1" legal) ( driver_rating = "0" legal) ( age_price_wsum = "0.33" legal) = = = = = = = = ( legal) = = P Y Using the above probabilities and Step 5 of the naive Bayesian algorithm, ( fraud) P( fraud) = = ( legal) P( legal) = = P Y P Y P ( fraud / Y ) = 2.58% and P ( legal / Y ) = 97.42%. Therefore, given nine missing values in the training examples, instance X is about 38 times more likely to be legal. As the nine missing values are simply not included in the frequency counts, the probability ratios are dependent on the number of values that are actually present C4.5 Classifiers The C4.5 classifier learns according to the algorithm in Appendix H based on the training data with two derived attributes in Appendix E to predict the outcome of instance Z = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0.66, age_price_wsum = 0.5 ). The attribute age_price_wsum is used for demonstration purposes. 28

41 According to Step 2 of the C4.5 algorithm, I (,18) = log log = According to Step 3 of the C4.5 algorithm, E log 9 9 ( age_price_wsum) = I ( 1,8 ) = log = According to Step 4 of the C4.5 algorithm, Gain ( age_price_ wsum) = = Through working on the other attributes, Gain ( age_price_wsum) has the highest information gain. The other information gains are Gain ( is_holidayweek_claim) = , Gain ( fault ) = Gain ( driver_rat ing) = , and According to Step 5 of the C4.5 algorithm, a decision tree is created, in Figure 3.2, by first having a node named age_price_wsum, and branches are grown for each of the attribute s values. Each rectangle node depicts a test of an attribute and each oval node (leaf) represents a class. The number in every node indicates the number of examples within it. 29

42 age_price_wsum? legal 3 driver_rating? 9 fraud 1 legal 6 legal fault? 3 legal 2 legal 1 legal 3 0 legal 1 1 is_holidayweek_claim? % fraud 2 legal 0 Figure 3.2: Fully Grown Decision Tree Therefore, using Figure 3.2, the C4.5 classifier predicts that instance Z is legal. Because there is a serious problem if missing values exist in the training data, it is not clear which branch should be taken when a node tests a missing attribute value. Witten and Frank (1999) suggest that the simplest solution is to assume that the absence of that value is not significant and use the branch with most examples as the missing attribute value. Using examples with missing values in Appendix F, the resultant decision tree is the same as in Figure 3.2. To avoid overfitting, subtree-raising postpruning can be used to remove insignificant branches and leaves to improve performance (Witten and Frank, 1999). This is generally restricted to raising the subtree of the most popular branch, for example, the subtree of driver_rating. As shown in Figure 3.3, the raising is done because the branch from the driver_rating node to the fault node has equal to, or more, examples than the other leaves on the same level. The entire subtree from the fault node downward has been raised to replace the subtree of the driver_rating node in Figure

43 age_price_wsum? legal 3 fault? 9 fraud 1 legal 6 legal 1 0 legal 3 1 is_holidayweek_claim? fraud 1 legal 5 Figure 3.3: Pruned Decision Tree Therefore, after pruning, the C4.5 predicts that instance Z is fraud, instead of legal, before pruning. This is a case of fragmentation where the number of examples at each given branch is too small to be statistically significant and the result is inaccuracy and incomprehensibility. Given more examples, other problems surface, like repetition, which turns out when an attribute is tested more than once along a given branch of a tree, and replication which occurs when subtrees are duplicated. According to Han and Kamber (2000), derived attributes can mitigate fragmentation, repetition and replication, and as have been used in the C4.5 discussions above. Knowledge in decision trees can be extracted in the form of IF-THEN rules. One rule is created for each path from the root to a leaf node. Table 3.1 displays seven descriptive rules which are created from Figure 3.3 so that the tree can be better understood. However, this ability is often overstated, as a large complex decision tree may contain many leaves that are not useful (Berry and Linoff, 2000) and they often lack interpretability (Elkan, 2001). 31

44 Rule Number Table 3.1: Extracting Rules from a Decision Tree IF-THEN rules 1 IF age_price_wsum = 0.33 THEN class = legal 2 IF age_price_wsum = 0.67 THEN class = fraud 3 IF age_price_wsum = 0.84 THEN class = legal 4 IF age_price_wsum = 1 THEN class = legal 5 IF age_price_wsum = 0.5 AND fault = 0 THEN class = legal 6 IF age_price_wsum = 0.5 AND fault = 1 AND is_holidayweek_claim = 0 THEN class = fraud 7 IF age_price_wsum = 0.5 AND fault = 1 AND is_holidayweek_claim = 1 THEN class = legal In the discussion so far, C4.5 has been applied on a small training set of only twenty examples. There are likely to be problems of scalability and efficiency when it is used to mine very large real data sets. Section proposes an approach to overcome this limitation Backpropagation Classifiers The backpropagation classifier learns according to the algorithm in Appendix I, using the first training example in Appendix E. Example 1 = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0, age_price_wsum = 0.5, class = d 1 = 0), where fraud is represented by 1 and legal is represented by 0 in the class attribute. To simplify the discussion, the learning rate c and steepness of the activation function λ are fixed at 1, and all initial weights w and v start with 0.1. Example 1 is fed into the network in Figure 3.4, and the net input and output of the hidden neurons and output neuron are calculated in Table 3.2 using Steps 2 and 3 of the backpropagation algorithm. The error of the output neuron is computed and propagated backwards in Table 3.3 to update the weights in Table 3.4 using Steps 4 and 5 of the backpropagation algorithm. Once all examples in the 32

45 training data set have been presented, one epoch is reached, and the total error of the network E is measured. x 1 = 0 1 W 15 x 2 = 1 2 W 16 W 17 W 25 W 26 W 27 5 v 1 x 3 = 0 3 W 35 W36 6 v 2 8 o W 37 v 3 W 45 x 4 = W 46 W 47 7 v 4 W 55 W 56-1 W 57-1 Figure 3.4: Multilayer Feedforward Neural Network* *Source: adapted from Han and Kamber (2000, p309) Table 3.2: Net Input and Output Calculations Neuron Net Input Output = = = ( 0.1)( 0.49) + ( 0.1)( 0.49) + ( 0.1)( 0.49) + (0.1)( 1) = y y y 1 = 1 + e 5 = = 1 + e 6 = = 1 + e 7 = o = 1 + e =

46 Neuron Table 3.3: Calculation of the Error at each Neuron Error 8 ( )( 0.49)( ) = ( )( 0.49)( ) = ( )( 0.49)( ) = ( )( 0.49)( ) = Weight Table 3.4: Calculations for Weight Updating New Value v 1 = v 2 = v 3 = v ( 0.12)( 0.49) = w ij ( 0.003)( 0.49) = E is continually reduced for a certain number of epochs, and the weights are finalised. After training all the 20 examples using software, the network ranks the proportion of attribute strength in descending order: age_price_wsum = 0.53, fault = 0.23, driver_rating = 0.13 and is_holidayweek_claim = The determination of attribute strength is important to decide which attributes are redundant and can be discarded. Therefore, the backpropagation classifier generates a numerical output of from instance Z = (is_holidayweek_claim = 0, fault = 1, driver_rating = 0.66, age_price_wsum = 0.5). The interpretation of this value is subjected to the decision threshold value. If the decision threshold is 0.5, instance Z is considered to be legal. If the decision threshold is 0.05, instance Z is regarded as fraud. To prevent overfitting, a regularisation method such as early stopping is used to reduce its risk (Elkan, 2001). From the backpropagation algorithm discussions above, three other main problems are obvious. Firstly, there cannot be any missing values in the training examples. Secondly, large data sets definitely result in very long training sessions. This is due to the numerous calculations and iterations involved in the algorithm, and extensive testing and retraining of parameters. Section proposes an approach to improve this problem. Thirdly, the results of the network are difficult to interpret. This 34

47 limitation is addressed in Section and Analysis of Algorithms The input data must be the same for all three algorithms so that predictions can be compared and combined. The naive Bayesian and C4.5 algorithms can train with both numeric and non-numeric data. However, the backpropagation algorithm must always train with numeric data. Due to this incompatibility, all training examples are scaled to numbers between 0 and 1, and transformed into one-out-of-n and binary encodings in Section There are two significant consequences of this new data requirement: all attributes must be treated as non-numeric for the naive Bayesian and C4.5 algorithms, and numeric for the backpropagation algorithm; also, the C4.5 graphic trees and descriptive rules are harder to interpret. These three algorithms promise the best predictive capability in fraud detection. Other classification algorithms such as k-nearest neighbour, cased-based reasoning, genetic algorithms, and rough sets either have scalability problems or are still in their prototype phase (Han and Kamber, 2000). Each of these three algorithms can be applied to many situations and possesses its unique strengths. Most of its weaknesses can be reduced with some algorithm modifications. More importantly, the diverse nature of their output computation allows their classifiers to recognise a wide range of fraud scenarios. Pie charts from the naive Bayesian algorithm, and graphic decision trees and descriptive rules from the C4.5 algorithm, can improve the analysis and interpretation of the data. 3.2 STEP TWO: COMBINING OUTPUT AS INTEGRATION MECHANISMS This section advocates an improved statistical approach to data preparation and training for combining classifier predictions. The use of bagging and stacking techniques for combining predictions from all classifiers and main predictions from all algorithms is demonstrated and evaluated. 35

48 3.2.1 Cross Validation Cross validation is an important statistical technique that separates the training data set into a fixed number of folds or data partitions. Stratification reduces variation in data by ensuring that sampling is done randomly to properly represent each class in the data partitions. Stratified tenfold cross validation is the standard way of assessing the success rate of an algorithm on a fixed sample of data. It means that each data partition is left out in rotation, the algorithm trains on the remaining nine data partitions, and the success rate is calculated on the data partition that is left out. Therefore, there are ten different training sessions and the ten success rate estimates are averaged. This often yields a slightly better success rate (Witten and Frank, 1999). This study provides a slight variation of cross validation. Instead of using ten data partitions, an oddnumbered eleven data partitions are used so that there will always be a majority class when the partitions contribute their class vote (see Section 3.2.2). In rotation, each data partition is used for training, testing and evaluation once. A training data partition is used to come up with a classifier, a test data partition to optimise the classifier s parameters and an evaluation data partition to compare the classifier with others. All the data partitions used by a single classifier are independent of each other. All classifiers must be trained before they are scored. Figure 3.5 demonstrates that the algorithm is first trained on partition 1 to generate classifier 1, tested on partition 2 to refine the classifier and evaluated on partition 3 to assess the expected accuracy of the classifier. The algorithm is next trained on partition 2 to generate classifier 2, tested on partition 3, and evaluated on partition 4. This continues until there are eleven training sessions with eleven classifiers. The classifiers are then applied to the score data set. Their corresponding success rate estimates and class predictions are recorded for further analysis (see Section 5.1.1). 36

49 Partition 1 Classifier 1 (Rough) Partition 2 Classifier 1 (Refined) Partition 3 Classifier 1 (Best) 1996 Score Data 4083 Examples Figure 3.5: Building and Applying One Classifier using Data Partitions* *Source: adapted from Berry and Linoff (2000, p185) This new approach to cross validation is important due to three reasons. Firstly, there is no need to train using the remaining eight partitions as there are more than one thousand examples for each partition (see Section 4.3.5). Secondly, tenfold cross validation is a guideline. Most likely, ninefold or elevenfold cross validation are likely to be almost as good (Witten and Frank, 1999). Thirdly, the much smaller data sets overcome the scalability problems for the C4.5 algorithm, and reduce training times for backpropagation algorithm by allowing training to be done on multiple computers Bagging Bagging (Breiman, 1994) combines the classifiers trained by the same algorithm using unweighted majority voting on each example or instance. Voting denotes the contribution of a single vote, or its own prediction, from a classifier. The main prediction is then decided by the majority of the votes. Table 3.5 shows the first eleven columns of naive Bayesian predictions and the last column consists of two main predictions from bagging. The first main prediction indicates fraud, as it has seven fraud predictions and four legal predictions from the eleven classifiers. The second main prediction suggests legal as it has five fraud predictions and six legal predictions from the eleven classifiers. Generally, bagging performs significantly better than the single model for C4.5 and backpropagation algorithms (Feelders, 2003). It is never substantially worse because it neutralises the instability of the 37

50 classifiers by increasing the success rate (Witten and Frank, 1999). Table 3.5: Bagging Predictions from Same Algorithm Main Prediction fraud fraud legal fraud legal fraud legal fraud fraud legal fraud fraud fraud fraud fraud legal legal fraud legal legal legal fraud legal legal Bagging can also combine the classifiers trained by the different algorithms. The class with the highest tally of main predictions is the final prediction. Table 3.5 shows that the final prediction on the first row is fraud because it is only opposed by the backpropagation algorithm. The final prediction on the second row is legal, as agreed to by both the naive Bayesian and C4.5 algorithms. However, if two of the algorithms perform consistently worse compared to the more accurate algorithm, bagging will not improve the final predictions. Table 3.5: Bagging Predictions from Different Algorithms Naive Bayesian C4.5 Backpropagation Final Prediction fraud fraud legal fraud legal legal fraud legal Stacking Stacking (Wolpert, 1992) combines classifiers trained by different algorithms by using a metaclassifier in Figure 3.6. To classify an instance, the base classifiers present their predictions to the meta-classifier which then makes the final predictions in Figure 3.7. Figure 3.6 shows three instead of eleven training data partitions to enable better understanding of this strategy. 38

51 Partition 1 1 Naive Bayesian NB Algorithm Classifiers 3 NB Predictions 3 Partition 2 C4.5 Algorithm 3 C4.5 Classifiers 3 C4.5 Predictions Partition 3 Backpropagation Algorithm 3 BP Classifiers 3 BP Predictions 3 Meta- Classifier 4 Naive Bayesian 4 Algorithm Combined Training Data Figure 3.6: Stacking Classifiers from Examples 1 3 NB Classifiers 2 3 NB Predictions 3 Score Data Set 3 C4.5 Classifiers 3 C4.5 Predictions 3 BP Classifiers 3 BP Predictions Final Prediction 4 Meta- Classifier 4 Combined Training Data Figure 3.7: Stacking Classifiers from Instances 1. In Figure 3.6, nine base classifiers are computed by the learning algorithms over three data partitions that are rotated as training, testing and evaluation data partitions. In Figure 3.7, the same nine base classifiers are used to generate predictions from the score data set. 2. Next, nine sets of predictions are generated. 3. In Figure 3.6, the combined training set is composed from these nine sets of predictions which become the attributes, and the actual classification which becomes the class label. In Figure 39

52 3.7, the combined training set does not have the class label. 4. In Figure 3.6, the meta-classifier is trained over this combined training set. In Figure 3.7, the meta-classifier is used to generate the final predictions. Stacking learns which classifiers are the reliable ones by learning the relationship between these predictions and the correct class. Hence, redundant classifiers can be discarded to increase predictive accuracy. Two problems emerge when combining naive Bayesian, C4.5 and backpropagation algorithms with bagging and stacking. The naive Bayesian and the C4.5 predictions, due to software constraints, are non-numeric in the form of either fraud or legal. However, backpropagation predictions are probabilities between 0 and 1. The remedy is to convert backpropagation output into categorical predictions using the most appropriate decision threshold value. Next, there is no specialised algorithm to produce a meta-classifier. Since most of the classification work has been done by the base classifiers, the naive Bayesian algorithm, which is simple and fast, is used. 3.3 STEP THREE: CLUSTER DETECTION AS ANALYTICAL MACHINERY This section demonstrates and recommends the SOM for cluster detection. It highlights the advantages and disadvantages of using the SOM in fraud detection Self Organising Maps Cluster detection, or clustering, is the natural grouping of similar data. Each group, or cluster, consists of data instances that are similar to one another and dissimilar to other data instances in other clusters (Han and Kamber, 2000). Like trained classifiers, clustering does not rely on class-labelled training examples. As there are many other clustering algorithms available, such as partitioning, hierarchical, density-based and grid-based methods, the focus here is on the model-based clustering method using neural networks. The SOM (Kohonen, 1982) is the neural network approach to clustering. It consists 40

53 of two layers of processing units, an input layer fully connected to a competitive output layer. Figure 3.8 shows that there are 4 input neurons and 16 output neurons with a neighbourhood size of 1. x 1 = 0 1 x 2 = 1 2 x 3 = 0 3 x 4 = Figure 3.8: The Self Organising Map* *Source: adapted from Bigus (1996, p72) Note that not all weights are shown. 1. Example 1 is presented to a SOM with randomised weight values, the output neurons compete with each other to become the winner. To become the winning output neuron, or winner, its connection weights must be the closest, based on Euclidean distance, to Example The winner has the right to have its weights adjusted. The connection weights are moved in the direction of Example 1 by the learning rate parameter. 3. The adjacent output neurons in the neighbourhood of the winner have their weights adjusted. Therefore, the connection weights of the whole neighbourhood are moved in the direction of Example As training progresses, the size of the neighbourhood and learning rate are decreased. Training stops after a pre-specified number of epochs Analysis of Self-Organisation High dimensional elements which contain the knowledge about the likely profiles, time and locations of illegitimate activity, are automatically clustered together into a more simple and transparent two 41

54 dimensional map. Connection weights of each cluster allow the evaluation of the relationships between the clusters. Due to these two main reasons, the SOM is used to combat internet fraud, border smuggling, arson, and organised burglary (Mena, 2003b). The trained SOM results can be separated into two distinct parts of fraud instances versus legal instances for further analysis on the classifiers results. The primary purpose of appending main predictions and the final prediction to the clustered data is to find out which clusters have the most fraud instances that are either undetected by all three of the algorithms or detected by only one of the algorithms, and interpret why it is so. Table 3.6 shows that all the fraud examples are in the SOM s clusters. Clusters 1 and 5 each has a fraudulent instance that cannot be detected. There is a need to find out which clusters have a significant amount of these undetectable instances and then interpret the likely profiles, times and characteristics of fraud from these clusters. Table 3.6: Evaluation of Fraud Detection Capability of Algorithms within Clusters Class SOM - 5 Clusters Naive Bayesian C4.5 Backpropagation Bagged Results 1 2 fraud fraud fraud fraud 1 3 legal fraud fraud fraud 1 4 fraud fraud legal fraud 1 5 legal fraud legal legal 1 1 legal legal legal legal The first critical issue in SOMs points to the choice of the optimum training architecture and parameters which can only be empirically determined. If the number of clusters is too large, the results are hard to present. If the number of clusters is too small, insights are difficult to find. Parameters such as initial weights and the number of learning epochs affect the training time of the SOMs. The learning rate and neighbourhood size control how much each neuron learns and how many neurons learn to adjust the connection weights. Although the SOM assigns each instance to a cluster, they do not specify an easy-to-understand model. The output layer must be examined to determine what collection of strong attributes form the 42

55 clusters and then calculate the typical values for instances in the same cluster. Column and radar graphs enable easier understanding of the clusters (see Section 3.4.2). 3.4 STEP FOUR: VISUALISATION TECHNIQUES AS VISUAL SYMBOLS This section focuses on the integration of human perceptual abilities in the data analysis process by presenting the data in some visual and interactive form (Keim and Ward, 2003). It shows how visualisation can facilitate the analysis of classifier predictions and performance. It also outlines how visualisation can aid the interpretation of clustering results Classification Visualisation The confusion matrix in Figure 3.9 is required to examine any classifier s performance in detail. True positives (hits) and true negatives (normals) are correct classifications. False positives (false alarms) happen if the actual outcome is legal but incorrectly predicted as fraud. False negatives (misses) occur when the actual outcome is fraud but incorrectly predicted as legal. Recall is the number of correct predictions divided by number of predictions made. Precision is the number of correct predictions divided by number of actual class. Correct Predictions: Incorrect Predictions: Total Number of Records: Accuracy Success Rate: Error Rate: fraud legal Alert True Positive False Positive No alert False Negative True Negative PRECISION RECALL Figure 3.9: The Confusion Matrix Naive Bayesian visualisation provides an interactive view of the prediction results. The main advantage is that attributes can be sorted by the best predictor and evidence items can be sorted by the number of items in its bin. Attribute contribution column graphs help determine which are the 43

56 significant attributes in neural networks. Decision tree visualisation builds graphic trees by splitting attributes from C4.5 classifiers. Cumulative lift charts compare the proportion of examples where the classifier correctly predicts a class divided by the proportion of total examples in that class. For example, if five percent of all examples are actually fraudulent and a naive Bayesian classifier could correctly predict 20 fraud examples per 100 examples, then that corresponds to a lift of 4. This can effectively be used to compare the performance of the three algorithms. These classification visualisation techniques allow insights to be gained and conclusions to be drawn. They are a convenient way of making discoveries apparent without going into precise mathematical detail Clustering Visualisation Column graphs indicate which typical examples belong to a cluster and what differentiates each cluster. Summaries of the independent columns by the cluster can be graphed. Radar graphs determine how similar or dissimilar the clusters are by mapping the distances between the cluster centres. These clustering visualisation techniques provide a much higher degree of confidence in the findings of the exploration and can lead to more interesting results. 3.5 SUMMARY OF NEW METHOD There are four consecutive steps in this crime detection method to classify new data instances in Figure 3.1. Step One produces predictions for each new instance by treating all classifiers as black boxes so that three learning algorithms can be used. The recommendation is to train and score the simplest learning algorithm first, followed by the complex ones. In this way, naive Bayesian classifiers are computed, followed by the C4.5 and then the backpropagation classifiers. The naive Bayesian predictions can be quickly obtained and analysed while the other predictions, which take longer 44

57 training and scoring times, are being processed. Step Two outputs a final prediction for each new instance in two ways. First, all predictions from each learning algorithm are bagged to produce a main prediction. There are three main predictions from the three learning algorithms that are then bagged to produce a final prediction. Second, all predictions from the three learning algorithms are combined into a training set for the meta-classifier to produce a final prediction. Predictions can be evaluated from the classifier, learning algorithm and combined levels. The choice of classifiers determines the main predictions of the learning algorithm. The performance of the learning algorithms affects the final prediction. Therefore, the selection of the best classifiers outputs the most accurate final prediction. Step Three generates an analysis of crime from the SOM clusters. Also, the main predictions and the final prediction are added to each of the corresponding clustered instance to understand the relationship between the classifier predictions and the clusters. The most relevant and important attributes for the analysis are selected from backpropagation classifiers because they produce the attributes strength in probabilities. Step Four describes the criminal patterns in a simple and understandable form by visualising the data and predictions from Steps One, Two and Three using graphs, scores, and rules to gain knowledge or insights. 45

58 CHAPTER 4 Science is a way of thinking much more than it is a body of knowledge. - Carl Sagan, 1979, Broca's Brain: Reflections on the Romance of Science Before anything else, preparation is the key to success. IMPLEMENTING THE CRIME DETECTION SYSTEM: PREPARATION - Alexander Graham Bell ( ) The landmark research work by Fayyad et al (1996) outlined the basic steps of the data mining process (see Appendix B). Brachman and Anand (1996) complemented this process by focusing on the human-process interaction. This was superseded by the practical and comprehensive CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology (Chapman et al, 2000). This chapter and the next make up the preparation and action components of the CRISP-DM-based crime detection system shown in Figure 4.1. The crime detection method is built into the action component of the crime detection system. Section 4.1 applies Phase One to comprehend the investigation problem. Section 4.2 examines Phase Two to understand the data and Section 4.3 applies Phase Three to prepare the data for mining. Section 4.4 supplements the system with a more realistic perspective. 4.2 Data Understanding Chapter 4 Preparation Component 4.1 Problem Understanding Data 4.3 Data Preparation Chapter 3 Crime Detection Method 5.3 Deployment 5.1 Modelling 5.2 Evaluation Chapter 5 Action Component Figure 4.1: Phases of the Crime Detection System* *Source: adapted from Chapman et al, 2000, p13 46

59 4.1 PHASE ONE: PROBLEM UNDERSTANDING This section implements the Phase One tasks by understanding the investigation objectives and situation, converting them into data mining objectives and then developing a preliminary plan for achieving all the objectives Determine the Investigation Objectives The primary organisational objective is to pick out the potential fraudsters from new insurance claims and confirm most of them as illegal with some further investigation. The investigative data mining must also explain the likely fraudster demographic profile, when they tend to strike and what the fraud characteristics are. To determine the value of the extracted knowledge from a business perspective, success is measured by expenditure reduction and is achieved when the project attains at least thirty percent of the maximum cost savings possible Assess the Situation One data analyst has been provided with Angoss KnowledgeSeeker IV, Data to Knowledge (D2K), Microsoft Excel, NeuroShell2, and Viscovery SOMine to accomplish the organisational objective. Other data mining tools for fraud detection was evaluated by Abbot et al (1998). The only available fraud detection data set in automobile insurance can be obtained from Angoss KnowledgeSeeker software (Pyle, 1999). This investigation removes the class label from the third year data in order to use it as the score data set. There is a possibility that demographic attributes do not hold as much predictive value as behavioural attributes (Elkan, 2001). There is a risk of project incompletion by the deadline and therefore the focus is on using the crime detection method within the overall process. A glossary of the main data mining and automobile insurance terms is provided (see Appendix A). One aim of data mining is to estimate the benefits and costs (Zadrozny and Elkan, 2001) of detecting fraud. The required cost model has two assumptions. First, all alerts must be investigated; second, the average cost per claim must be at least ten times more than the average cost per investigation. The 47

60 average cost per claim for the score data set is approximated at USD$2,640 (Insurance Information Institute, 2003) and average cost per investigation is estimated at USD$203 for ten manpower hours (Payscale, 2003). Table 4.1: Costs of Predictions Fraud Alert Hit Cost = Number of Hits * Average Cost Per Investigation No alert Miss Cost = Number of Misses * Average Cost Per Claim Legal False Alarm Cost = Number of False Alarms * (Average Cost Per Investigation + Average Cost Per Claim) Normal Cost = Number of Normal Claims * Average Cost Per Claim Table 4.1 shows that hits and false alarms require investigation costs. False alarms are the most expensive because they incur both investigation and claim costs. Misses and normals pay out the usual claim cost. There are two extremes of this model: at one end, data mining is not used at all (no action), so all claims are regarded as normals; at the other end, data mining achieves the perfect predictions (best case scenario), so all claims are predicted as hits and normals. Therefore the evaluation metrics for the predictive models on the score data set to find the optimum cost savings are, Model Cost Savings = No Action [( Miss Cost + False Alarm Cost + Normal Cost) + Hit Cost] Percentage Saved = ( Model Cost Savings / Best Case Scenario Cost Savings )* Determine the Data Mining Objectives The primary data mining objective is to detect the maximum hits with the minimum false alarms for new claims under the cost model, given two years of claim history Produce the Project Plan One hundred working days are given to complete the project. Ten days are allocated for each of the problem understanding, evaluation and deployment phases. Fifteen days are set aside for the data understanding phase. Thirty five days are allotted for the modelling phase and twenty days for the data preparation phase. According to accepted industry standards, the modelling phase generally needs less 48

61 time, but trying out the crime detection method requires a longer experimentation period. Also, the data understanding and data preparation phases usually last much longer. However in this context, the data set in question does not need to be collected and integrated. The tools required to demonstrate the techniques in the project are D2K for naive Bayesian and C4.5 algorithms, NeuroShell2 for backpropagation and SOM algorithms, Microsoft Excel for visualisations, and Angoss KnowledgeSeeker IV for rule induction. 4.2 PHASE TWO: DATA UNDERSTANDING This section does not describe the recommended data collection task in Phase Two because the data set has already been provided. Familiarisation with the data set is done by describing, exploring and verifying the data and its quality Describe the Data The automobile insurance data set contains examples from January 1994 to December 1995, and 3870 instances from January 1996 to December It has a 6% fraudulent and 94% legitimate distribution, with an average of 430 claims per month. The original data set (see Appendix J) has 6 numerical attributes and 25 categorical attributes, including the two category class label (fraud/legal). From a critical viewpoint, the data analyst can lament the lack of very powerful fraud predictors in the form of behavioural attributes such as a claimant s occupation, salary, and education level. Without the cost of each individual claim, a more realistic cost model cannot be made. On the other hand, the data analyst must make the most of the knowledge from the data even if it has problems like incompleteness and inconsistency (Chen, 2001) Explore the Data Through extensive querying and visualisation of each data attribute, three interesting facts stand out. Figure 4.2 shows that the number of fraudulent claims underwent a few rise-and-fall cycles. January 49

62 1994 to August 1994 accounts for 91% of fraudulent claims for the year After three quiet months, there was another increase of fraudulent claims from December 1994 to March April 1995 to August 1995 accounts for only 4% of fraudulent claims for the year 1995, after which fraud was rife up to December The consecutive months with dominant illegitimate activities could be due to a wave of hard fraud committed by professional offenders. The data analyst can then hypothesise that fraudulent claims will probably drop for the first few months in the year Frequency of legitimate versus fraudulent claims (Fraud) January March May July September November January March May July September November Consecutive Months Figure 4.2: Claim Trends by Month Figure 4.3 depicts that, for both years 1994 and 1995, if the age of the vehicles is between 3 and 6 years old, then the chances of fraudulent claims are higher than the overall average of 6%. Given that both years have a vehicle age category that exceeds the rest by at least 5%, the data analyst can speculate that there is likely to be one or more outlier categories in the year

63 Age of Vehicles 8 years years years years years years years year years years years years years years years year % 2.00% 4.00% 6.00% 8.00% 10.00%12.00%14.00%16.00%18.00% Proportion of fraudulent versus total claims Figure 4.3: Proportion of Fraud within the Vehicle Age Category Figure 4.4 shows the proportion of fraud committed across the age groups for the years 1994 and In both years, middle-aged claimants were responsible for almost 80% of the fraud. But the interesting group is the younger fraudsters between 16 to 25 years old. Although they account for only 6.34% of the total fraud, the proportion of fraud within their age groups is 13%, which is twice the proportion of the whole data set. Because these younger age groups account for only 3% of all the claims, they are relatively easy to monitor % 15.21% 7.32% 3.66% 4.23% 26<=X<=30 31<=X<=35 36<=X<=40 41<=X<= % 0.28% 51<=X<=65 X>65 16<=X<= % 1.83% 18<=X<= % 21<=X<=25 Figure 4.4: Proportion of Fraud within the Policy Holder Age Category 51

64 4.2.3 Verify the Data Quality The data quality is good but there are some impediments. The original data set consists of the attribute PolicyType which is an amalgamation of existing attributes VehicleCategory and BasePolicy. There are invalid values of 0 in each of the attributes MonthClaimed and DayofWeekClaimed for one example. Some attributes with two categories, like WitnessPresent, AgentType, and PoliceReportFiled, have highly skewed values where the minority examples account for less than 3% of the total examples. The attribute Make has a total of 19 possible attribute values of which claims from Pontiac, Toyota, Honda, Mazda, and Chevrolet account for almost 90% of the total examples. There are three spelling mistakes in Make: Accura (Acura), Mecedes (Mercedes), Nisson (Nissan), and Porche (Porsche). Attributes address_change-claim and number_of_cars can have fewer discrete values. For claims made by 16 to 20 year olds, more than 95% of their vehicles are barely one year old and 95% of their insured vehicle prices are above $69,000. This piece of information seems to conflict with common sense because most young drivers cannot afford new and expensive cars. 4.3 PHASE THREE: DATA PREPARATION This section encompasses the tasks in Phase Three to build the input data set for the learning algorithms. The suggested data integration task is not necessary because only one data set is available. Instead, the data partitioning task is added to prepare multiple data partitions for the modelling phase. Therefore, the data preparation phase involves the selecting, cleaning, formatting, constructing and partitioning of the data Select the Data Of the original data attributes given (see Appendix J), all have been made discrete in advance. For example, instead of a real-valued attribute giving the precise value of the insured vehicle, this data set includes only a discrete-valued attribute that categorises this amount into one of six different discrete levels. Most of the data attributes are retained for the data analysis. The only attribute discarded up to this stage is PolicyType which is redundant. Up to this stage, domain knowledge indicates that the 52

65 fraud is usually masterminded by middle-aged male policyholders (National White Collar Crime Center, 2002; Smith, 2003). Therefore the attribute Fault = PolicyHolder and Age = 31<=X<=50 can be key indicators of fraud Clean the Data To improve the data quality (see Section 4.2.3), the invalid values can be replaced with the majority attribute value. In this case, MonthClaimed = January and DayofWeekClaimed = Monday replaces the 0 values but the next simple alternative is to delete this example. The four spelling mistakes in the attribute Make are corrected for all the examples. Nothing can be done with the three highly skewed attributes. The number of discrete values in attribute Make is not reduced because certain car brands with few claims can be responsible for high fraud occurrences. Attributes address_change-claim and number_of_cars have five discrete values which are normalised to four (Pyle, 1999) Format the Data All characters are converted to lowercase letters and the underscore symbol is used for better readability. The attribute Year has been placed beside the other date attributes for easier comparisons Construct the Data Three derived attributes, weeks_past, is_holidayweek_claim, and age_price_wsum, are created to increase predictive accuracy for the algorithms (see Appendix J). The new attribute, weeks_past, represents the time difference between the accident occurrence and its claim application. The hypothesis states that if this difference is larger than average, fraud is more likely. The position of the week in the year that the claim was made is calculated from attributes month_claimed, week_of_month_claimed, and year. Then the position of the week in the year when the accident is reported to have happened is computed from attributes month, week_of_month, and year. The latter is subtracted from the former to obtain the derived attribute weeks_past. This derived attribute is then 53

66 categorised into eight discrete values. The derived attribute is_holidayweek_claim indicates whether the claim was made in a festive week (Berry and Linoff, 2000). The data analyst speculates that average offenders are more likely to strike during those weeks because of the need to increase their spending. A major assumption about this data set being from the US has to be made. Therefore, in the years 1994 and 1995, the attribute is_holidayweek_claim is set to 1 if the claim is made during a week containing at least one US public holiday. This computed attribute is binary-valued. The attribute age_price_wsum is the weighted sum of two related attributes, age_of_vehicle and vehicle_price. The premise is that the more expensive and the older the vehicle gets, the possibility of the claim being fraudulent becomes higher. This derived attribute has integer values has seven discrete values. This thesis requires all input for the naive Bayesian, C4.5 and backpropagation algorithms to be numerical (see Appendix K). Fourteen attributes are scaled in the range 0 to 1. Nineteen attributes with no logical ordering are represented by either one-of-n or binary coding. According to Smith (1999), one-of-n coding is used to represent a unique set of inputs and is done by having a length equal to the number of discrete values for the attribute. In Appendix J, there are the values 1994, 1995 and 1996 for attribute Year which are represented by 1 0 0, 0 1 0, and respectively. This coding is simple to understand and easy to use. However, it is not suitable for attributes with a large number of values. Binary coding overcomes the limitation of one-of-n coding but has increased complexity by representing each discrete value with a string of binary digits (Bigus, 1996). In Appendix J, there are twelve values for attribute month that are represented with a binary code vector of length 4 (16 possible values). In Appendix K, the attribute values, January and December, are converted to

67 and respectively Partition the Data According to Chan et al (1999), the desired distribution of the data partitions belonging to a particular data set must be determined empirically. In a related study, Chan and Stolfo (1995) and Prodromidis (1999) recommended that data partitions should neither be too large for the time complexity of the learning algorithms nor too small to produce poor classifiers. Given this information, the approach in this thesis is to randomly divide all the legal examples from the years 1994 and 1995 into eleven sets with X number of examples. The data partitions are formed by merging all the available X number of fraud examples with each of the eleven sets containing the legal examples. Therefore, skewed data is transformed into partitions with more of the rarer examples and fewer of the common examples using the procedure known as data multiplication (Pyle, 1999) or oversampling (Berry and Linoff, 2000). Figure 4.5 highlights repeating the same 923 fraud examples in the eleven data partitions with different 923 legal examples to generate a 50/50 fraud/legal distribution. Other possible distributions are 40/60 (615 fraud examples/923 legal examples) and 30/70 (396 fraud examples/923 legal examples). Therefore, there are eleven data partitions with 1846 examples, another set of eleven data partitions with 1538 examples, and the last eleven data partitions with 1319 examples. By having 923 legal examples, there are roughly two months of legal examples in each partition and this is significant for deployment. This ensures that classifiers can be chosen to easily move into the future because it is not being fixed to a particular time frame in the past (Berry and Linoff, 2000). These three distributions are experimented on to determine the best distribution for the modelling. 55

68 1996 Score Data 4083 Examples 1994 and 1995 Training Data Examples 923 Fraud Examples Legal Examples Partition legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal 923 examples legal examples 923 Legal Examples Figure 4.5: Dividing Data into 50/50 Distribution for Each Partition 4.4 KEY ISSUES IN PREPARATION When preparing the crime detection system to generate predictions, two conditions about the data must be satisfied (Berry and Linoff, 2000). First, relevant data must exist and also be accessible. The access to some of the investigative data warehouses introduced in Section 1.1 is restricted to authorised investigation units. However, some relevant data can be bought from commercial data warehouses. Second, the past data must be a good predictor of the future data. This may not be the case because some significant external events cannot be captured in the data. For example, when the unemployment rate is higher, fraud is usually more prevalent. Domain experts, like the claims investigators with their experience and intuition, are crucial elements which harness better predictive power from the past data (Fayyad et al, 1996; Berry and Linoff, 2000). Their experience confirms the results of the data analysis and provides better understanding of what really needs to be done. Their intuition guides data analysts to concentrate on the particular data examples or attributes. 56

69 The use of data visualisations such as column graphs, bar graphs and pie charts (see Section 4.2.2) is essential to tease out the patterns in each attribute (Soukup and Davidson, 2002). They bring out the sophistication and pattern recognition capabilities of the human mind to analyse and choose attributes which are most relevant to the objectives. Although domain experts do provide confirmations and good advice, the data analyst should not be too constrained by this background information. Insights, enhanced by visualisation techniques and tools, often come from letting the data speak for itself. 57

70 CHAPTER 5 Plain question and plain answer make the shortest road out of most perplexities. - Mark Twain, 1875, Life on the Mississippi The great end of life is not knowledge but action. IMPLEMENTING THE CRIME DETECTION SYSTEM: ACTION - Thomas H. Huxley ( ) The action component of the crime detection system uses the crime detection method to mine the prepared data partitions. This chapter discusses three distinct questions: Which experiments generate the best predictions? Which are the best insights? How can these new models and insights be used in an organisation? Section 5.1 takes Phase Four as a guide to produce classification and clustering models. Section 5.2 makes use of Phase Five to assess the models ability to achieve the business objectives and Section 5.3 explains how Phase Six can deploy the best models within the organisation. 5.1 PHASE FOUR: MODELLING This section implements the Phase Four tasks by using the crime detection method. This phase includes generating experiment design, building and assessing the models Generate the Experiment Design Table 5.1 lists the ten experiments conducted. Experiments I, II and III were designed to determine the best training distribution under the cost model (see Section 4.1.2). Because the naive Bayesian algorithm is extremely time efficient, it was trained with 50/50, 40/60 and 30/70 distributions in the data partitions. Therefore, the experiments IV and V used the best training distribution for the backpropagation and C4.5 algorithms. Experiment VI used bagging to combine main predictions and Experiment VII used stacking to combine all predictions. Experiment VIII proposed to bag the best classifiers determined by stacking. Experiments VI to VIII determine which ensemble mechanism produces the best cost savings. Experiment IX implemented the backpropagation algorithm on 58

71 unpartitioned data, which is the most commonly used technique in fraud detection commercial software. This experiment was then compared with the others described above. Experiment X produced the SOM clusters to understand the data, the main predictions, and the final predictions. Table 5.1: The Experiments Plan Experiment Number Technique or Algorithm Data Distribution I Naive Bayes 50/50 II Naive Bayes 40/60 III Naive Bayes 30/70 IV Backpropagation Determined by Experiments I, II, III V C4.5 Determined by Experiments I, II, III VI Bagging - VII Stacking - VIII Stacking and Bagging - IX Backpropagation 5/95 X Self Organising Map 5/95 Table 5.2 lists the eleven tests, labelled A to K, which were repeated in each of Experiments I to V. In other words, there are fifty five tests in total for experiments I to V. Each test consisted of training, testing, evaluation, and scoring. The score set was the same for all classifiers but the data partitions labelled 1 to 11 were rotated. The overall success rate (see Section 2.4) denotes the ability of a classifier to provide correct predictions. The bagged overall success rates X and Z were compared to the averaged overall success rates W and Y. The main predictions for the first five experiments were obtained by bagging the eleven predictions on the score set, represented by Bagged Z in Table 5.2. Table 5.2: The Tests Plan Test A B C D E F G H I J K Overall Success Rate Training Set Partition Testing Set Partition Evaluation Partition Evaluating Success Rate A B C D E F G H I J K Average W Bagging Predictions A B C D E F G H I J K Bagged X Producing Classifier Scoring Set Success Rate A B C D E F G H I J K Average Y Bagging Main Score Predictions B C D E F G H I J K Bagged Z 59

72 5.1.2 Build the Models Appendices L, M, and N illustrate the D2K and NeuroShell2 software interfaces and parameters of which the naive Bayesian, C4.5, backpropagation and SOM models are trained and scored. After performing Experiments I, II, and III, bagging was found to improve most of the success rates. Table 5.3 shows that the bagged success rates X outperformed all the averaged success rates W by at least 10% on evaluation sets. When applied on the score set, bagged success rates Z performed marginally better than the averaged success rates Y. Therefore, the bagged predictions were used for comparisons between all the experiments under the cost model (see Section 4.1.2). Table 5.3: Bagged Success Rates versus Averaged Success Rates Experiment Number Average W Bagged X Average Y Bagged Z I 71% 85% 12% 11% II 65% 80% 67% 70% III 68% 87% 74% 76% Experiments I, II, and III were also used to determine the best training fraud/legal distribution for data partitions. Table 5.4 proves that Experiment II achieved higher cost savings than Experiments I and III. Therefore, Experiments IV and V were trained with the 40/60 data partitions. Table 5.4: Best Determining Training Set Distribution for Data Partitions Experiment Number Cost Model I -$220,449 II $94,734 III $75,213 IV -$6,488 V $165,242 Table 5.4 displays the cost savings of the experiments II, IV and V which were trained, tested, and evaluated with the same eleven 40/60 data partitions. The C4.5 algorithm achieved the highest cost savings followed by the naive Bayesian algorithm. In contrast, the backpropagation algorithm increased the cost. With the cost model and the 40/60 training data partitions, C4.5 is the best learning algorithm for this particular automobile insurance data set. 60

73 Table 5.5 shows the cost savings after combining naive Bayesian, C4.5 and backpropagation predictions from Experiments II, IV, and V. The bagging of the three main predictions in Experiment VI and the stacking of all predictions in Experiment VII did not produce better cost savings than the C4.5 algorithm predictions in Experiment V. However, the interesting discovery was made of using stacking to choose the best classifiers predictions for bagging. The resultant predictions were slightly better than those of the C4.5 algorithm. Appendix O illustrates the top fifteen, out of thirty three classifiers, produced from stacking. There were nine C4.5, four backpropagation, and two naive Bayesian classifiers and their predictions on the score data set were bagged. This combination of the top fifteen classifiers achieved the best predictions among the other combinations of top five, top ten, or top twenty classifiers. This supports the notion of using different algorithms with stacking-bagging for any data set. Table 5.5: Bagging versus Stacking versus Conventional Backpropagation Experiment Number Cost Model VI $127,454 VII $104,887 VIII $167,069 IX $89,232 The backpropagation algorithm in Experiment IX was trained on the unpartitioned 1994 and 1995 data set, and scored on the 1996 data set. However, one fundamental limitation exists in the top five experiments and this is mainly due to the software used. According to Elkan (2001), any learning algorithm which is applied on skewed data sets is useful only when the algorithm gives numerical scores, or probabilities, to rank examples. Appendix P demonstrates that, unlike backpropagation algorithm in NeuroShell2 software, the naive Bayesian and C4.5 algorithms in the D2K software are unable to output scores. The importance of scores and thresholds for prioritisation in organisations is illustrated in Experiment IX. In Appendix Q, the backpropagation algorithm assigns a score between 0.5 and 0 to each of the 4083 score data set instances. They are then ranked in descending order of fraud potential and 61

74 prioritised to be handled by five different types of claims handling actions. Thresholds allow the best action to be taken to maximise investigation efficiency and cost savings (Magnify, 2002). Table 5.9 lists the claim handling actions and its corresponding threshold. A claim is associated with an action if it is ranked lower than the next highest threshold. For example, if a claim is ranked 5 th out of examples, it is assigned = which is less than and, therefore, this claim is = 4083 immediately sent for investigation. If another claim is ranked 4000 th, it is assigned which is less than 1, therefore this claim receives express payment. Table 5.6: Claim Handling Actions and Thresholds Claim Handling Actions Thresholds Automatic referral to investigation unit immediately Route to adjustor and notify investigation unit to actively monitor 0.05 Notify adjustor of high fraud potential 0.25 Routine claim handling 0.9 Express claim payment 1 Figure 5.1 proves that, by using sorted scores and predefined thresholds, investigations are concentrated on the instances which have the highest probability of cost savings. Based on the cost model (see Section 4.1.2) and to achieve the maximum cost savings of about $90,000, the investigations can be stopped after reviewing the top 27% or 1087 examples (see Appendix Q). This satisfies Pareto s law which declares the minority of input (reviewing the high risk claims) produces the majority of results (highest cost savings). 62

75 Percentage of Fraud Examples in each Grou 14% 12% 10% 8% 6% 4% 2% 0% 14% 13% 11% 4% Claim Handling Action 0.50% Automatic referral to investigation for immediate investigation (3 out of 21 examples fraudulent) Route to adjuster, notify investigation unit to actively monitor (24 out of 183 examples fraudulent) Notify adjuster of high fraud potential (65 out of 613 examples fraudulent) Routine claim handling (119 out of 2858 examples fraudulent) Express claim payment (2 out of 408 examples fraudulent) Figure 5.1: Prioritisation of Fraud Data Instances in Organisations In comparison with the score-based approach, decision tree rule induction from Angoss KnowledgeSeeker IV software produces descriptive sentences on the instances with the highest probability of fraud (see Appendix R). Table 5.7 displays the top five rules out of a total of nineteen rules from the original symbolic training data. According to Elkan (2001), rules often suffer from lack of interpretability. Many rule induction techniques produce an ordered set of rules in which each applies to an instance only if all of its predecessors in the ordering do not apply. Therefore, each rule is not valid in isolation but must be understood in combination. The information provided by the top five rules did not provide any insights. Domain knowledge on automobile insurance fraud (National White Collar Crime Center, 2003) already reveals that fraudsters usually buy policies with comprehensive coverage and the staged accidents mostly do not involve third parties. The highest ranked rule detail claims made by 16 to 20 year olds as the highest risk. Yet, this piece of information is already available (see Section 4.2.3). In fact, most of the information in the five rules has already been discovered before mining the data (see Section 4.2.2). Therefore, the score-based approach is favoured over the rule-based approach. 63

76 Table 5.7: Rule Induction on Training Data Set Rule Number IF-THEN rules 1 IF AgeOfPolicyHolder = 16 to 17 or 18 to 20 Fault = Policy Holder BasePolicy = All Perils THEN FraudFound = No 69.8% FraudFound = Yes 30.2% (19/63 examples) 2 IF AccidentArea = Rural AgeOfPolicyHolder = 21 to 25, 26 to 30, 31 to 35 or 36 to 40 Fault = Policy Holder BasePolicy = All Perils THEN FraudFound = No 73.9% FraudFound = Yes 26.1% (36/138 examples) 3 IF AccidentArea = Urban AgeOfPolicyHolder = 21 to 25, 26 to 30, 31 to 35 or 36 to 40 Fault = Policy Holder BasePolicy = All Perils THEN FraudFound = No 83.5% FraudFound = Yes 16.5% (176/1065 examples) 4 IF AgeOfVehicle = 2 years, 3 years, 4 years or 5 years NumberOfCars = 1 vehicle or 2 vehicles AgeOfPolicyHolder = 41 to 50, 51 to 65 or over 65 Fault = Policy Holder BasePolicy = All Perils THEN FraudFound = No 50.0% FraudFound = Yes 50.0% (6/12 examples) 5 IF AgeOfVehicle = 6 years, 7 years, more than 7 or new NumberOfCars = 1 vehicle or 2 vehicles AgeOfPolicyHolder = 41 to 50, 51 to 65 or over 65 Fault = Policy Holder BasePolicy = All Perils THEN FraudFound = No 90.3% FraudFound = Yes 9.7% (77/793 examples) Experiment X trained the SOM with three, five and seven output neurons, designed to allow three, five, seven clusters to form respectively. The visualisations determined the optimum number of descriptive clusters based on the intrinsic nature of the data. To facilitate analysis of fraud after SOM training, the fraud instances were separated from the legal ones, and the main and final predictions were appended to them. Figure 5.2 highlighted the best fifteen attributes for visualisation from Experiment IX. 64

77 Input Strength Attribute Name fault base_policy(3) age_price_wsum number_of_cars base_policy(1) make(3) number_of_suppliments vehicle_category(1) witness_present day_of_week accident_area month_claimed(2) agent_type age_of_policy-holder month(4) deductible Figure 5.2: Attribute Input Strength from the Backpropagation Algorithm Column graphs are used to build a descriptive profile of each cluster using the attributes in Figure 5.2. Appendix S contains three-cluster, five-cluster and seven-cluster column graphs showing the distribution of age in fraudulent examples. The three-cluster solution provided only one insight and the seven-cluster solution was difficult to analyse and interpret. Appendices S, T, and U grouped some of the fifteen attributes into three categories: likely fraudster demographic profile, when they tend to strike and the fraud characteristics. The first category consists of the attribute age_of_policyholder. The second category is made up of weeks_past and is_holidayweek_claim. The third category contains make, accident_area, vehicle_category, age_price_wsum, number_of_cars, and base policy. The rest of the attributes in Figure 5.2 such as fault, number_of_suppliments, witness_present, agent_type are omitted because they do not present any interesting patterns in the clusters Assess the Models The number of fraudulent examples and the size of the training data set are too small. With a statistical view of prediction, 710 fraudulent training examples are too few to learn with confidence. In the real world, much deeper knowledge and insights are derived from millions of training examples 65

78 than from a single data set with examples. The score data set of 4083 instances is also too small to objectively determine which learning algorithms are more accurate (Elkan, 2001). Consider the null hypothesis that an algorithm achieves p = 13% hits in the top 5% of predictions. On a randomly chosen score data set of 4083 examples, the expected number of hits is µ = pn = 27 where n = = 204. The anticipated standard deviation of the number of correct predictions is σ = np( 1 p) = 4.6. To reject the null hypothesis with a confidence of over 95%, an algorithm has to score more than two σ above or below the expected number, that is less than 18 or more than 36 correct. To be significantly better than Experiment IX, the other experiments have to attain 36 hits out of its top 204 predictions. As the small score data set of 4000 instances in the CoIL Challenge (2000) data mining competition illustrates, it is very hard to differentiate the accuracy of learning algorithms. The success rate A of Experiment IV and success rate B of Experiment V are compared using the paired Student s t-test with k-1 degrees of freedom. The null hypothesis states that both backpropagation and C4.5 algorithms have the same success rate. On the score set, there are k = 11 runs, so t = k = i 1 ( o 18) i 10 2 = 2.69 < With 99% confidence, the difference in success rates between the two algorithms, backpropagation and C4.5, using the 11-fold cross-validation method, is not statistically significant. The success rate A of Experiment II and success rate B of Experiment V are compared using the paired Student s t-test with k-1 degrees of freedom. The null hypothesis states that both naive Bayesian and C4.5 algorithms have the same success rate. On the score set, there are k = 11 runs, so 66

79 t = k = 2 11 i 1 ( o 2) i 10 2 = 0.38 < With 99% confidence, the difference in success rates between two algorithms, naive Bayesian and C4.5, using the 11-fold cross-validation method, is not statistically significant. The data analyst has to rank experiments using cost savings due to the size limitations of the training and score data sets, and the inability to compare experiments using success rates. Table 5.8 ranks Experiment VIII first for the highest cost savings which is almost twice that of Experiment IX. The performance of the top five experiments against Experiment IX strengthens the case for training on partitioned data with the right fraud/legal distribution. The optimum success rate is 60% for highest cost savings in this skewed data set and, as the success rate increases, cost savings decrease. Table 5.8: Ranking of Experiments using Cost Model Rank Experiment Technique or Algorithm Cost Savings Overall Success Percentage Saved Number Rate 1 VIII Stacking and Bagging $167,069 60% 29.71% 2 V C4.5 40/60 $165,242 60% 29.38% 3 VI Bagging $127,454 64% 22.66% 4 VII Stacking $104,887 70% 18.65% 5 II Naive Bayes 40/60 $94,734 70% 16.85% 6 IX Backpropagation 5/95 $89, % 15.87% 7 IV Backpropagation 40/60 -$6,488 92% -1.15% The best naive Bayesian classifier A and best C4.5 classifier B, ranked by stacking in Appendix O, are compared using McNemar s hypothesis test. The null hypothesis states that both naive Bayesian and C4.5 algorithms have the same success rate. On the score set, each classifier gave 3741 predictions that the other did not correct predictions are unique to A, 2150 correct predictions are unique to B, so ( ) ( ) s = = 83.2 > With 99% confidence, the difference in success rates between the classifiers A and B is statistically 2 67

80 significant. This proves that Experiment VIII is robust because it chooses a wide range of classifiers with very different success rates to produce the best cost savings. Table 5.9 lists the likely fraudster profile, operating times and characteristics of 923 fraudulent instances in each cluster from the years 1994, 1995, and There are 115, 72, 175, 94, and 159 fraudulent instances from 1994 and 1995 in Clusters 1, 2, 3, 4, and 5 respectively, which are already scored by the algorithms. The interest is in the 100, 94, 93, 9, and 12 fraudulent instances from 1996 in Clusters 1, 2, 3, 4, and 5 respectively, which are not yet scored by the algorithms. This shows that in 1996, Clusters 1, 2, and 3 have higher occurrences of fraud than Clusters 4 and 5. Table 5.9 also displays the descriptive cluster profiles for interpretation according to the available domain knowledge. Clusters 1, 3, and 5 consist of several makes of inexpensive cars which indicate signs of soft fraud. In these clusters, the potential for high-cost claims from utility vehicles, rural areas, and liability policies are relatively low. In contrast, both Clusters 2 and 4 contain claims which are often submitted many weeks after the alleged accidents. Here, the chances of getting high-cost fraudulent claims from one specific make of car, sport cars, and multiple policies from these clusters are extremely likely. Table 5.9: Descriptions of Existing Fraud Cluster Number of Descriptive Cluster Profile instances Cluster 1 contains a large number of 21 to 25 year olds. The insured vehicles are relatively new Cluster 2 also contains a large number of 21 to 25 year olds. The claims are usually reported 10 weeks past the accident. The insured vehicles are usually sport cars Cluster 3 has almost all 16 to 17 year old fraudsters. The insured vehicles are mainly Acuras, Chevrolets, and Hondas. The insured vehicles are usually utility cars Cluster 4 has claims are usually reported 20 weeks after the accident. Almost all insured cars are Toyotas and the fraudster has a high probability of getting 3 to 4 cars insured. Claims are unlikely to be submitted during holiday periods Cluster 5 consists of mainly Fords, Mazdas, and Pontiacs. There are higher chances of rural accidents and the base policy type is likely to be liability. 68

81 Statistical evaluation of the descriptive cluster profiles is required. With the information provided in Cluster 4, the investigation is focused on the 3121 claims for Toyota cars. 6% or 187 of these were fraudulent. Within this group, there were 2148 claims for Toyota sedan cars. If Toyota sedan cars were not highly prone to fraud, 6% or 129 of these car claims can be expected to be fraudulent, with ( ) 11. σ = 129 = In fact, 171 claims for Toyota sedan cars are fraudulent. The z-score of this discrepancy is ( 129) /11 = σ This is an insight because this information is statistically reliable, not known previously, and actionable - there must be closer monitoring on Toyota sedan cars by investigation units. Table 5.10 displays the other insights determined from the descriptive cluster profiles. The discovery of 21 to 25 year olds who utilise sport cars for fraud, with a standard deviation of 9.5 σ, is the most statistically reliable insight. Table 5.10: Statistically Strong Indicators of Fraud Cluster Group Claims No. and Sub-Group Claims Expected Actual z-score % of No. of No. of 1 All Fraud Fraud Fraud (6%) 21 to 25 year olds ± σ claims 2 Sport cars (1.6%) 21 to 25 year olds ± σ 3 16 to 17 year olds + Sport cars (9.7%) Honda + 16 to 17 year olds 31 3 ± σ Appendix V shows the analysis of 615 fraudulent instances from the years 1994 and 1995, with the main predictions from the three algorithms and the final predictions from the bagging technique appended (Experiment VI). The two column graphs reflect the capability of the algorithms to cope with the evolving fraud. The first graph shows the lift of 25 fraudulent instances, which cannot be detected by any algorithms, within the five clusters. 7 and 9 of these instances are in Clusters 1 and 2 respectively, signifying that these can be new modus operandi patterns of fraud allegedly committed by 21 to 25 years olds who possibly insures sport cars. There are no undetectable instances in Cluster 3. This indicates that the Cluster s fraud patterns can be detected by the algorithms because the fraud pattern of claims made by 16 to 17 year olds for Honda cars is already learned. Clusters 1 and 2 69

82 demonstrate that not all instances of fraud can be detected by the algorithms. More importantly, domain knowledge, cluster detection and statistics offer a possible explanation for the poor performance of classifiers on certain examples or instances. The second graph shows the lift of 101 fraudulent instances, which cannot be detected by two algorithms, within the five clusters. This highlights the weakness of using bagging in Experiment VI. 37 and 26 of these instances are in Clusters 1 and 2 indicating that the choice of the technique or algorithm is an important factor for fraud detection in these two clusters. To overcome this, either stacking-bagging in Experiment VIII or the C4.5 algorithm in Experiment V is recommended. The purpose of the following is to assess the three initial hypotheses on the data set by exploring the individual attributes (see Section 4.2.2) and demonstrate the importance of pre-mining and postmining examination of attributes. Figure 5.3 shows that the number of fraudulent claims did drop for the first few months in the year After three quiet months, there was an increase in fraudulent claims from April 1996 to November The first hypothesis is supported. 400 Frequency of legitimate versus fraudulent claims (Fraud) January February March April May June July August September October November December Consecutive Months Figure 5.3: Claim Trends by Month for

83 Figure 5.4 show that there is no outlier category for vehicle age category in the year 1996 as had been anticipated. However, 3 to 6 year old vehicles still have higher proportion of fraud than the overall average of 6%. 8 years years - 96 Age of Vehicles 6 years years years years years year % 1.00% 2.00% 3.00% 4.00% 5.00% 6.00% 7.00% 8.00% Proportion of fraudulent versus total claims Figure 5.4: Proportion of Fraud within a Vehicle Age Category for 1996 The third hypothesis surmises that new patterns of fraud are likely to come from younger fraudsters between 16 to 25 years. This is confirmed by the dominant Clusters 1, 2, and 3 in Table PHASE FIVE: EVALUATION This section represents tasks in Phase Five to make sure that any problems with models are resolved before their deployment. The tasks involve evaluating results and reviewing the process. For this project, the next step after evaluation is to stop further iterations due to time constraints and move on to deployment Evaluate the Results The maximum possible cost savings is $562,340 which occurs when the models achieve all hits and normals for the data set. To achieve the investigation objective, the models must achieve cost savings 71

84 of $168,702. Table 5.8 shows that predictions from Experiment VIII almost achieved the target of thirty percent. Table 5.9 highlights the important descriptions of fraud within clusters. Table 5.10 identifies the statistically reliable insights in the form of Toyota sedan cars, 21 to 25 year olds who drive sport cars, and 16 to 17 year olds who drive Honda vehicles. Therefore, the project has met its initial business objectives Review the Process There are many important factors that could have improved the prediction results or the quality of the project. The most critical ones, listed in descending priority, suggest that: Unsupervised learning can be used to derive clusters, followed by supervised learning to generate predictions. The results of this approach can be used to perform a comparison with the crime detection method which only utilised supervised learning for prediction generation. Instead of one training partition for each test within the experiments, eight partitions of data can be used. Generally, more training data increases the chances of producing better classifiers. More skewed distributions such as 20/80, 10/90, and 1/99 fraud/legal distributions, can be experimented on, to evaluate the effectiveness of techniques under extreme conditions (Chan et al, 1999). The cost model is too simple. It does not take into account the litigation costs required for taking fraudulent claims to court (National White Collar Crime Center, 2002) and the salary costs of the data analyst. The Probabilistic Neural Network (PNN) provides a good alternative to the Backpropagation Neural Network (BNN) on skewed data sets (Phua, 2003a). 72

85 5.3 PHASE SIX: DEPLOYMENT This section shows how Phase Six guides the use of models within the organisation. The two key tasks of this phase are plan deployment, and plan monitoring/maintenance. The final report and project review are two additional tasks in the CRISP-DM reference model which are not discussed in this section. Chapters 4 and 5 present the final report and Chapter 6 is the review of the project Plan the Deployment Distributed data mining (DDM) is one of the top trends in data mining (Grossman, 2001; Hsu, 2002). DDM helps organisations gain competitive advantage by dealing with the exponentially growing volumes of geographically dispersed data. Data can be inherently distributed and cannot be combined into one database for a variety of practical reasons. For example, crafting a single monolithic data set from physically distributed databases for data mining is problematic (Provost, 2000) as it consumes a lot of time and hard disk space. For example, competitive and legal concerns prohibit organisations from sharing sensitive data (Stolfo et al, 1997). DDM techniques aim to scale up to massive databases (Kargupta and Chan, 2000) by ensuring shorter computation times and better accuracy. One existing method, known as loosely coupled DDM, is a black-box technique which partitions the data to create multiple models in different data sites, which are then combined into a single model (see Sections 3.2, 4.3.5, and 5.1.3). To improve comprehensibility, Sections 3.3 and introduced the use of cluster detection and statistics to understand the predictions. Other related research to understand black-box predictions such as Domingos (1997) and Guo and Sutiwaraphun (1998) used predictions of the combining technique as training examples to produce a graphic decision tree. According to Provost (2000), one disadvantage of loosely coupled DDM lies in the use of data partitions which can reduce the predictive accuracy compared to that achieved by running one algorithm on all the data. 73

86 The time taken to collect data, transform data into inputs, train and score models, and deploy output, must be taken into account (Berry and Linoff, 2000). For example, to predict fraudulent activity in January 1996, December 1995 data is unlikely to be ready at time of model building. Instead, models can be trained based on December 1994 to November 1995 data, and scored with December 1995 data. The adoption of an agent-based DDM allows the parallel execution of the data mining process at the numerous crime data source locations. Phua (2003b) recommends combining the strengths of existing agent-based DDM systems such as Java Agents for Meta-learning (JAM) (Stolfo et al, 1997), Kensington (Guo et al, 1997), Papyrus (Grossman et al, 1999) and Besizing Knowledge through Distributed Heterogeneous Induction (BODHI) (Kargupta et al, 1999) for fraud detection Plan the Monitoring and Maintenance The shelf-life of models and predictions is determined by the rate of change in the external environment and the organisation s requirements. As a rough guide, models have to be rebuilt when the cost savings are below a certain percentage of the maximum possible. 74

87 CHAPTER 6 The future belongs to those who believe in the beauty of their dreams. - Eleanor Roosevelt ( ) CONCLUSION The main focus of this thesis is Investigative Data Mining in Fraud Detection, in other words, on the use of the best data mining methods and techniques to combat white-collar crime. The term investigative refers to the official attempt to extract some truth, or insights, about criminal activity from the data. To do so, this dissertation first explored existing fraud detection methods and evaluated the supervised, unsupervised, score-based, and rule-based approaches to determine which ones are more important. Here, the original idea is to use all four approaches to complement one another. The recommended crime detection method uses these four approaches within a framework inspired by Minority Report (Dick, 1956). The main idea is to transform the black-box approach of classification into the semi-transparent state using cluster detection and visualisations. Also, the choice of three classification techniques, and their naive Bayesian, C4.5, and backpropagation algorithms, is justified. The predictive power of the classification models, or classifiers, can be evaluated with the paired Student s t-test with k-1 degrees of freedom and the McNemar s hypothesis test. This research formulated four consecutive steps in the crime detection method to predict and describe an instance as either fraud or legal. First, the classifiers generate predictions. Next, the predictions are combined to improve the main and final predictions. Third, data is clustered and the main and final predictions are appended. Last, graphs, scores, and rules are produced for analysis and interpretation by the data analyst. 75

88 The CRISP-DM methodology offers a systematic and comprehensive approach to using the crime detection method in automobile insurance fraud detection with six phases. This study separated the crime detection system into two main components, each consisting of three phases, to mine the automobile insurance data set: the preparation of data with respect to objectives, and the actual mining of data which leads to assessment and deployment. This paragraph describes the preparation component. In Phase One: Problem Understanding, the most important tasks are to determine the investigation and data mining goals, define the cost model, and specify the time, techniques, and tools required. In Phase Two: Data Understanding phase, the most crucial tasks are to know the available data, formulate hypotheses by exploring the data, and highlight the data limitations. In Phase Three: Data Preparation, the most essential tasks are to choose relevant attributes, clean and format dirty values, construct derived attributes, and partition the data into nonskewed distributions. This paragraph details the action component. In Phase Four: Modelling, the empirical evaluation of the crime detection system aims to find the best predictions, descriptions, and ways to apply them in an organisation. To do so, a significant number of models are generated from four algorithms, two ensemble mechanisms on the training data set, the score data set, and thirty three data partitions of three different distributions. The four algorithms included the naive Bayesian, C4.5, backpropagation and SOM; and the two ensemble mechanisms comprised of bagging and stacking. All classifier predictions are evaluated by a cost model and the important cluster descriptions are verified by z- scores. The score-based approach is preferred over the rule-based approach. In Phase Five: Evaluation, the extensive experimentation revealed that by using stacking to determine the best classifiers and then bagging their predictions, the highest cost savings is achieved. The most statistically significant insight identifies 21 to 25 year olds who drive sport cars as the new fraudsters. In Phase Six: Deployment, the use of loosely coupled distributed data mining is recommended to generate predictions for geographically different data sources. The critical issues concerning 76

89 availability of recent data and the need to simultaneously compute multiple models for deployment are briefly discussed. The models and predictions have to be regenerated once a certain percentage of possible cost savings is not achieved. 6.1 RECOMMENDATIONS This thesis proposes the following solutions, already discussed in the previous chapters, to overcome the technical and practical problems for data mining in fraud detection Overcome the Technical Problems Inadequate data: Statistical evaluation and confidence intervals such as the paired Student s t-test with k-1 degrees of freedom and the McNemar s hypothesis test are needed to evaluate the algorithms and their classifiers for this fundamental problem. The preparation component of the crime detection system uses the three phases of problem understanding, data understanding, and data preparation to ensure every important step in preparing the data for mining is followed. Derived attributes are constructed from other existing attributes to increase the success rate of algorithms on the data. Cross validation randomly divides up the data into partitions to improve the average success rate of predictions. Bagging utilises the majority vote from algorithms to increase the success rate of the final predictions. Stacking uses a meta-classifier to determine the best classifiers for producing the best final predictions from data. Highly skewed data: Partitioned data with the most appropriate distribution transforms highly skewed distributions into more evenly balanced ones for training and evaluating classifiers. The cost model provides a better evaluation metric than success rates on skewed data. 77

90 The Laplace estimator reduces the chances of overfitting on skewed data sets for the naive Bayesian algorithm, just as subtree-raising postpruning does for the C4.5 algorithm, and early stopping of training does for the backpropagation algorithm. Statistical evaluation and confidence intervals such as the z-scores evaluate the significance of insights. Black-box predictions: Classification visualisation such as the confusion matrix, naive Bayesian visualisation, and attribute contribution column graphs, facilitates the analysis of classifier predictions. After applying the SOM on the data, clustering visualisation such as column graphs indicate which attribute value is more significant in a particular cluster. Sorted scores and predefined thresholds direct investigation to the instances which have the highest fraud risk. Rules provide logical explanations for better interpretation Alleviate the Practical Problems Lack of domain knowledge: The action component of the crime detection system uses the three phases of modelling, evaluation, and deployment to ensure maximum effort is exerted to extract insights from the data. The domain experts, if available, provide experience and intuition to confirm findings and to guide data analysts. An extensive literature review of the types of fraud pertinent to data mining and the three main types of fraudsters was conducted. In addition, findings, results, and solutions provided by existing fraud detection research, from both academic and commercial communities, provided useful information. 78

91 Great variety of fraud scenarios over time: The SOM transforms multi-dimensional data into two-dimensional clusters which contain similar data to enable the data analyst to easily differentiate the groups of fraud. The SOM also allow the data analyst to assess the algorithms ability to cope with evolving fraud. The crime detection method provides a flexible step-by-step approach to generating predictions from any three different algorithms, and uses some form of ensemble mechanisms to increase the likelihood of correct final predictions. The naive Bayesian, C4.5, and backpropagation algorithms complement one another and they compute predictions in very different but effective ways to detect fraud. Assessing the data mining potential: The quality and quantity of data affects whether data mining in fraud detection can be successful. Investigative data warehouses are restricted to authorised investigation units but commercial data warehouses do allow fee-required access. The cost model adds up the cost of hits, false alarms, misses, and normals to evaluate classifier predictions, with the objective of finding the lowest cost. The z-score results use basic statistics to evaluate cluster descriptions, with the objective of determining the highest standard deviation. 6.2 FUTURE RESEARCH DIRECTIONS This final section concludes with the discussion of two interesting and practical research projects which can directly or indirectly benefit from this thesis. Their purpose, proposed methods and techniques, are described; and the possible contributions from this thesis is also offered Credit Application Fraud This probable research project aims to prevent the crime of the future (Abagnale, 2001) - identity fraud - from manifesting in approved credit applications (FairIsaac, 2003a). The proposed approach 79

92 is to systematically examine multiple data sources such as credit application, credit bureau, white pages, and electoral data using hybrid intelligent techniques. The SOM, a form of unsupervised learning, is used for cluster detection. The ANN classifiers, a form of supervised learning, are used to generate predictions from each cluster. All the predictions are then combined by a cost-sensitive, weight-updating genetic algorithm. Both the credit application fraud project and this dissertation have the common purpose of fighting crime, and encounter the same problems when applying data mining methods and techniques. This thesis offers a supervised learning approach using three different classification algorithms as an alternative to ANN classifiers, to generate predictions from each cluster. It also advocates approaches from machine learning research to combine predictions, such as bagging, stacking, and boosting. Cost-sensitive algorithms such as the MetaCost algorithm (Zadrozny and Elkan, 2001) and the AdaCost algorithm (Chan et al, 1999) are recommended Mass Casualty Disaster Management This possible research project s main objective is to reduce rule-based decision-making errors (Reason, 1990) in managing unavoidable mass casualty disasters. The proposed EPC framework offers the capability to document as-is and simulate to-be organisational work processes to diminish the errors and data mining is used to increase its decision making capabilities. The proposed role of ANNs is to reduce the frequency and risk-creating potential of rule-based errors by discovering statistically reliable patterns in the system parameters of discrete-event simulation models. This thesis suggests the use of other data mining techniques such as BBNs and decision trees. It also highlights the possibility of using Bayesian reasoning to manage uncertainty in using the rules (Negnevitsky, 2002). 80

93 APPENDIX A GLOSSARY KEY DATA MINING TERMS Algorithm The step-by-step details of a particular way of implementing a technique: for example Self Organising Map, k-means Approach A way of training and getting results: for example supervised learning, rules generation Attribute (field, variable, feature, column) A single characteristic of an example Cluster A set of instances grouped together because of similarity Example An instance labelled by a class label Evaluation set - Data outside the model set used to assess the expected model accuracy Instance (tuple, record, case, row) A data record that bear no class label (target attribute) Method The use of more than two techniques in sequence or in parallel on the same data: for example automatic cluster detection followed by classification on the different clusters Score set Data used to produce predictions, usually by the best model Technique A conceptual approach to extracting information from data: for example automatic cluster detection Testing set Data used by algorithms to refine the model Training set Data used to build a model although the resulting model may be overfitted A1

94 KEY CRISP-DM TERMS CRISP-DM methodology The general term for all concepts developed and defined in CRISP-DM Model (classifier) An executable is applied to a data set to predict the class label Phase The high-level term for part of the CRISP-DM reference model and consists of related tasks Reference model The decomposition of data mining projects into phases, tasks, and outputs Task A series of activities to produce one or more outputs, part of a phase Output The tangible result of performing a task KEY AUTOMOBILE INSURANCE TERMS Claim cost The average amount paid in advance in insurance claims for each vehicle Collision It pays for repairs to the policyholder s car, regardless of who was responsible All-perils - It pays for repairs to the policyholder s car, including theft, fire, hail, vandalism, and others Liability - It pays for legal defence costs and claims if the car injures or kills another person. Also pays for damages to the other car(s) or property. It does not pay for repairs to the policyholder s car Rating The price of the policy is based on the claims history of all the drivers it insures in a particular territory, and other factors such as age and driving record Deductible The monetary amount policyholders must pay before collecting claims from the insurance company A2

95 APPENDIX B GENERIC DATA MINING PROCESS* 9 Consolidation of discovered knowledge 5-6 Selection and Application of Data Mining Task and Algorithm 8 Knowledge Interpretation amd Evaluation Knowledge 7 Application of data mining algorithm(s) Patterns 4 Data Transformation Transformed Data 3 Data Preprocessing Preprocessed Data 2 Data Selection Data Target Data Problem and Data Understanding 1 Problem and Data Understanding 9 Refinement through modifications *Source: based on Fayyad et al (1996, p10) A3

96 APPENDIX C TAXONOMY OF FRAUD Average offender Criminal offender Organised crime offender Soft fraud Hard fraud Fraud Against individuals Against organisations Against governments Fraud in Australian Companies companies 9 insurance companies AUD$273 (about USD$178) M loss USD$0.5M average loss per company Fraud in Singapore Companies 2002 About 135 companies SGD$49 (about USD$28) M loss USD$0.2M average loss per company Internal management External parties Non-management employees 51% 47% USD$59M (33%) USD$8M (28%) 16% 35% Investment fraud Internet fraud Identity theft Credit card fraud (28%) Insurance fraud USD$21M or AUD$32 M (35%) Cheque forgery (23%) Telecommunications (10%) Theft of inventory/plant Other (1%) Defence fraud Health care fraud Tax fraud Automobile Arson fraud Workers Health care Travel fraud Household fraud compensation fraud contents A4

97 APPENDIX D TRAINING SET 1996 Examples sex fault driver_rating number_of_suppliments class legal legal legal legal legal legal legal fraud legal legal legal fraud legal legal legal legal legal legal legal legal 1996 Instance sex fault driver_rating number_of_suppliments class X (21) ? (legal) A5

98 APPENDIX E TRAINING SET DERIVED ATTRIBUTES 1996 Examples is_holidayweek_claim fault driver_rating age_price_wsum class legal legal legal legal legal legal legal fraud legal legal legal fraud legal legal legal legal legal legal legal legal 1996 Instance is_holidayweek_claim fault driver_rating age_price_wsum class Y (21) ? (legal) Z (22) ? (legal) A6

99 APPENDIX F TRAINING SET MISSING VALUES 1996 Examples is_holidayweek_claim fault driver_rating age_price_wsum class 1? 1 0? legal legal legal 4 1? legal legal legal 7 0 1? 1 legal fraud legal ? legal legal 12? fraud 13? legal legal legal 16 0? legal legal legal ? 0.84 legal legal 1996 Instance is_holidayweek_claim fault driver_rating age_price_wsum class Y (21) ? (legal) Z (22) ? (legal) A7

100 APPENDIX G Step NAIVE BAYESIAN ALGORITHM* Description 1 Assume that there are two output classes, fraud and legal. Given an instance X = (x 1, x 2,, x n ) with A 1, A 2,,A n attributes. Maximise the probability of the two classes based on Bayes theorem using, 2 ( X ) ( X fraud) P( fraud) P( X ) P P fraud = and ( X ) P legal = P ( fraud ) si = s P ( X legal) P( legal) P( X ) where si is the number of training examples of class fraud and s is the total number of training examples. 3 In order to reduce the computation in evaluating P ( X fraud), the naive assumption of no dependent relationships among the attributes is made. Thus, P ( X fraud ) = P( xk fraud) n k = 1 The probabilities ( x1 fraud), P( x2 fraud),..., P( xk fraud) training examples using, P can be estimated from sik P ( xk fraud ) = si where s ik is the number of training examples of class fraud having the value x k for A k, and si is the number of training examples belonging to class fraud. P X legal, the computation is the same as Step 3. 4 For ( ) 5 Only P( X fraud) P( fraud) and P ( X legal) P( legal) need to be maximised as P ( X ) is constant for the two classes. Therefore, the classifier will predict that X belongs to a class which has the highest posterior probability, conditioned on X. That is, ( fraud X ) P( fraud) P( legal X ) P( legal) P > or vice versa. *Source: adapted from Witten and Frank (1999, pp82-89) and Han and Kamber (2000, pp ) A8

101 APPENDIX H Step C4.5 ALGORITHM* Description 1 Assume that there are two output classes, fraud and legal. The tree starts as a single node N representing the training examples. If the examples class is all fraud, then the node becomes a leaf and is labelled as fraud. The same is true if class is all legal. 2 If the examples are not of the same class, information gain measure points out the attribute with the highest gain A to separate the examples into individual classes. The expected information needed to classify a given example X is, I ( fraud_examples,legal_examples) Number_of_fraud_examples = log Number_of_examples 2 Number_of_fraud_examples Number_of_examples Number_of_legal_examples Number_of_legal_examples log 2 Number_of_examples Number_of_examples 3 The entropy, or expected information based on the partitioning into subsets by test attribute E, E ( A) = Number_of_test_attribute_fraud_values + Number_of_examples Number_of_test_attribute_legal_values Number_of_examples [ I( test_attribute_fraud_values,test_attribute_legal_values) ] The smaller the entropy value, the greater the purity of the subset partitions. 4 The expected reduction in entropy caused by knowing the test attribute A is, Gain( A) = I E( A) The attribute with the highest gain is the test attribute. 5 A branch is created for each known value of the test attribute, and the examples are partitioned accordingly. The algorithm uses the same process iteratively to form a decision tree for the examples at each partition. Once an attribute has occurred at a node, it need not be considered in any of the node s descendents. 6 The iterative partitioning stops only when one of the following conditions is true: (a) All examples for a given node belong to the same class, or (b) There are no remaining attributes on which examples can be further partitioned. If this is the case, a leaf is created with the majority class in examples. (c) There are no examples for the branch the known value in the test attribute. If this is the case, a leaf is created with the majority class in examples. *Source: adapted from Han and Kamber (2000, pp ) A9

102 APPENDIX I Step BACKPROPAGATION ALGORITHM* Description 1 Assume that there are two output classes, fraud and legal. Initialise weights in the network to small random numbers between 0 and 1. Present an N dimensional example X to the input layer, including an extra input of -1 for the threshold. 2 Each hidden neuron s net inputs is determined by, N h net j + 1 = i = 1 where w x ij i wij is the weight of the connection from unit i in the input layer to unit j in the hidden layer, xi is the output value of unit i from input layer. Each hidden neuron s outputs is determined by applying logistic function on net input, 1 y j = h λnet 1 j + e where it maps a large input domain onto a smaller range of 0 to 1. 3 The output neuron s (one output neuron for two classes) net input is determined by, J o net = j = 1 where neuron, v j y j v j is the weight of the connection from unit j in the hidden layer to the output y j is the output value of unit i from hidden layer. The output neuron s (one output neuron for two classes) output is determined by applying logistic function on net input, o k 1 = 1+ e λnet o 1 4 The error, or learning signal, for the output neuron is, ( d o ) o ( ) r o 1 = λ o1 where o1 is the actual output and 1 of the logistic function The error, or learning signal, for the hidden neurons are, r h j o ( r v ) y ( y ) = λ 1 where 1 hidden layer. j j j d is the desired output. ( ) o1 1 o 1 is the derivative v j is the weight of the connection from the output neuron to unit j in the o r 1 is the error of the output neuron. A10

103 5 The weights in the output layer are updated with, v ( t + 1) = v ( t) + cλ ( d o ) o ( o ) y ( t) j j The weights in the hidden layer are updated with, w ji o ( t + ) = w ( t) + cλ ( r v ) y ( 1 y ) x ( t) 1 ji 1 6 The training terminates when either: (a) a prespecified number of epochs has expired, or o (b) E E ( r ) j j j i j in the previous epoch is smaller than the prespecified threshold. E is the error of each epoch. *Source: adapted from Smith (1999, pp45-51) and Han and Kamber (2000, pp ) A11

104 APPENDIX J DATA DICTIONARY - ORIGINAL AND SYMBOLIC ATTRIBUTES Original Attributes Categories Symbolic Attributes Brief Description Categories Month 12 month accident month 12 WeekOfMonth 5 week_of_month accident week of month 5 DayOfWeek 7 day_of_week accident day of week 7 Make 19 make car manufacturer 19 AccidentArea 2 accident_area city or country 2 DayOfWeekClaimed 2 month_claimed claim month 2 MonthClaimed 12 week_of_month_claimed claim week of month 12 WeekOfMonthClaimed 5 day_of_week_claimed claim day of week 5 Sex 2 year 1994, 1995, MaritalStatus 4 weeks_past accident and claim difference 8 Fault 2 is_holidayweek_claim claim was made on a holiday week 2 PolicyType 9 sex gender mostly male 2 VehicleCategory 3 marital_status mainly single or married 4 VehiclePrice 6 fault policyholder or third party 2 RepNumber 16 vehicle_category sedan, sport or utility 3 Deductible 4 vehicle_price mainly between $20,000 and $39,000 6 DriverRating 4 rep_number ID of person who processed the claim 16 Days:Policy-Accident 5 deductible policyholder payment before claim disbursement 4 Days:Policy-Claim 4 driver_rating higher indicates less premiums to pay 4 PastNumberOfClaims 4 days_policy-accident days left in policy when accident happened 5 AgeOfVehicle 8 days_policy-claim days left in policy when claim was filed 4 AgeOfPolicyHolder 9 past_number_of_claims usually 4 or less previous claims 4 PoliceReportFiled 2 age_of_vehicle usually 6 years or older 8 WitnessPresent 2 age_price_wsum age and price vehicle price combined 7 AgentType 2 age_of_policy-holder mainly between 31 to 50 years old 9 NumberOfSuppliments 4 police_report_filed only 2.85% filed a police report 2 AddressChange-Claim 5 witness_present only 0.57% had witnesses 2 NumberOfCars 5 agent_type only 1.59% are internal, rest are external 2 Year 3 number_of_suppliments about 50% do not have any suppliments 4 BasePolicy 3 address_change-claim about 90% do not have any address changes 4 FraudFound 2 number_of_cars about 90% have only one car insured by the same insurer 4 base_policy all-perils, collision or liability 3 class only about 6% of claims are fraudulent 2 A12

105 APPENDIX K DATA DICTIONARY NUMERIC ATTRIBUTES Numeric Attributes Representations Categories Numeric Attributes (continued) Representations Categories month(1) 2 marital_status(1) 2 month(2) 2 marital_status(2) 2 Binary coded one-of-n coded month(3) 2 marital_status(3) 2 month(4) 2 marital_status(4) 2 week_of_month(1) 2 fault Binary values 2 week_of_month(2) Binary coded 2 vehicle_category(1) 2 week_of_month(3) 2 vehicle_category(3) one-of-n coded 2 day_of_week Continuous values 7 vehicle_category(2) 2 make(1) 2 vehicle_price Continuous values 6 make(2) 2 rep_number(1) 2 make(3) Binary coded 2 rep_number(2) 2 Binary coded make(4) 2 rep_number(3) 2 make(5) 2 rep_number(4) 2 accident_area Binary values 2 deductible Continuous values 4 month_claimed(1) 2 driver_rating Continuous values 4 month_claimed(2) 2 days_policy-accident Continuous values 5 Binary coded month_claimed(3) 2 days_policy-claim Continuous values 4 month_claimed(4) 2 past_number_of_claims Continuous values 4 week_of_month_claimed(1) 2 age_of_vehicle Continuous values 8 week_of_month_claimed(2) Binary coded 2 age_price_wsum Continuous values 7 week_of_month_claimed(3) 2 age_of_policy-holder Continuous values 9 day_of_week_claimed Continuous values 7 police_report_filed Binary values 2 year(1) 2 witness_present Binary values 2 year(2) one-of-n coded 2 agent_type Binary values 2 year(3) 2 number_of_suppliments Continuous values 4 weeks_past(1) 2 address_change-claim Continuous values 4 weeks_past(2) Binary coded 2 number_of_cars Continuous values 4 weeks_past(3) 2 base_policy(1) 2 is_holidayweek_claim Binary values 2 base_policy(2) one-of-n coded 2 sex Binary values 2 base_policy(3) 2 class Binary values 2 A13

106 APPENDIX L TRAINING, SCORING NAIVE BAYES* TRAINING NAIVE BAYES Automatic binning was used to transform the numeric data for all naive Bayesian models in the icons Bin Columns and Create Bin Tree. SCORING NAIVE BAYES *Source: adapted from Board of Trustees of the University of Illinois (2003) A14

107 APPENDIX M TRAINING, SCORING C4.5* TRAINING C4.5 The minimum leaf ratio for all C4.5 models was set to in the icon C4.5 Tree Builder to create a split node instead of a leaf if the information gain measure was greater. SCORING C4.5 *Source: adapted from Board of Trustees of the University of Illinois (2003) A15

108 APPENDIX N TRAINING, SCORING BACKPROP & SOM* TRAINING BACKPROPAGATION To maximise the cost savings under the 40/60 distribution, 0.1 learning rate, 0.1 momentum, 100 hidden neurons, and 0.8 decision threshold were set. To maximise the cost savings under the 5/95 distribution, 0.1 learning rate, 0.1 momentum, 100 hidden neurons, and 0.13 decision threshold were set. SCORING BACKPROPAGATION A16

109 TRAINING AND SCORING SOM The best parameters values to set for the SOM were 0.5 learning rate, 0.5 initial weights and 1000 epochs. Neighbourhood size was one less than the number of output neurons. *Source: adapted from Ward Systems Group (1996) A17

110 APPENDIX O RANKED CLASSIFIERS (STACKING) A18

111 APPENDIX P OUTPUTS OF ALL ALGORITHMS NAIVE BAYES AND C4.5 OUTPUT BACKPROPAGATION OUTPUT SOM OUTPUT A19

112 APPENDIX Q SCORES AND THRESHOLDS A20

113 APPENDIX R DECISION TREE RULE INDUCTION* No Yes Total (10727) (683) BasePolicy 94.0% 6.0% No Yes Total All Perils (2971) (330) 3301 Fault 90.0% 10.0% 28.9% No Yes Total Collision (4097) (324) 4421 Fault 92.7% 7.3% 38.7% No Yes Total Liability (3659) (29) % 0.8% 32.3% Days:Policy-Claim No Yes Total Policy Holder (1764) (321) % 15.4% 18.3% AgeOfPolicyHolder No Yes Total Third Party (1207) (9) 1216 Deductible 99.3% 0.7% 10.7% No Yes Total Policy Holder (2784) (307) % 9.9% 27.1% VehicleCategory No Yes Total Third Party (1313) (17) 1330 Deductible 98.7% 1.3% 11.7% No Yes Total 15 to 30 (5) (1) % 16.7% 0.1% No Yes Total 8 to 15 more than 30 none (3654) (28) % 0.8% 32.3% No Yes Total 16 to to 20 (44) (19) % 30.2% 0.6% No Yes Total Rural (102) (36) 138 No Yes Total 21 to to to to 40 (991) (212) 1203 AccidentArea 73.9% 26.1% 1.2% 82.4% 17.6% 10.5% No Yes Total Urban (889) (176) % 16.5% 9.3% No Yes Total 1 vehicle 2 vehicles (722) (83) 805 AgeOfVehicle No Yes Total 89.7% 10.3% 7.1% 41 to to 65 over 65 (729) (90) % 11.0% 7.2% NumberOfCars 3 to 4 5 to 8 more than 8 No Yes Total No Yes Total (7) (7) (0) (1) % 50.0% 0.1% 0.0% 100.0% 0.0% No Yes Total 400 (1176) (0) % 0.0% 10.3% No Yes Total 500 (10) (8) % 44.4% 0.2% No Yes Total 700 (21) (0) % 0.0% 0.2% No Yes Total No Yes Total 1 to 7 15 to 30 8 to 15 more than 30 (2680) (269) 2949 Sedan Utility (2689) (274) % 9.2% 26.0% Days:Policy-Accident 90.9% 9.1% 25.8% No Yes Total none (9) (5) 14 No Yes Total 64.3% 35.7% 0.1% Sport (95) (33) % 25.8% 1.1% No Yes Total No Yes Total 1 year 4 to 8 years no change (1269) (0) (1271) (1) % 0.1% 11.1% AddressChange-Claim 100.0% 0.0% 11.1% No Yes Total 2 to 3 years (2) (1) 3 No Yes Total 66.7% 33.3% 0.0% 500 (10) (16) % 61.5% 0.2% No Yes Total 700 (32) (0) % 0.0% 0.3% No Yes Total 2 years 3 years 4 years 5 years (6) (6) % 50.0% 0.1% 6 years 7 years more than 7 new No Yes Total (716) (77) % 9.7% 7.0% *Source: based on Angoss Software Corporation (1998) A21

114 APPENDIX S FRAUDSTER PROFILE AGE_OF_POLICYHOLDER - THREE CLUSTERS Profiling Clusters of Fraudulent Claims by Age % Lift Over Proportion in Fraud Populatio % % % % % 50.00% 16<=X<=17 18<=X<=20 21<=X<=25 26<=X<=30 31<=X<=35 36<=X<=40 41<=X<=50 51<=X<=65 X> % Cluster 1 Cluster 2 Cluster 3 Profile of Clusters by Age AGE_OF_POLICYHOLDER - FIVE CLUSTERS Profiling Clusters of Fraudulent Claims by Age % Lift Over Proportion in Fraud Populatio % % % % % % % 16<=X<=17 18<=X<=20 21<=X<=25 26<=X<=30 31<=X<=35 36<=X<=40 41<=X<=50 51<=X<=65 X> % 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Profile of Clusters by Age AGE_OF_POLICYHOLDER - SEVEN CLUSTERS Profiling Clusters of Fraudulent Claims by Age % Lift Over Proportion in Fraud Populatio % % % % % % % 16<=X<=17 18<=X<=20 21<=X<=25 26<=X<=30 31<=X<=35 36<=X<=40 41<=X<=50 51<=X<=65 X> % 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Profile of Clusters by Age A22

115 APPENDIX T FRAUDSTER OPERATING TIMES WEEKS_PAST Analysis of Clusters of Fraudulent Claims by Weeks Past % Lift of Proportion in Fraud Populatio % % % 50.00% weeks_past - 0 weeks_past - 1 weeks_past - 2 weeks_past - 3 weeks_past - 4 weeks_past - 5<=X<=9 weeks_past - 10<=X<=19 weeks_past - 20<=X<= % Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Weeks Past Accident Before Claim IS_HOLIDAYWEEK_CLAIM Analysis of Clusters of Fraudulent Claims by Holiday Week % Lift Over Proportion in Fraud Populatio % % 80.00% 60.00% 40.00% 20.00% is_holidayweek_yes is_holidayweek_no 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Holiday Week A23

116 APPENDIX U FRAUDSTER CHARACTERISTICS MAKE Analysis of Clusters of Fraudulent Claims by Make (Car Type) % Lift Over Proportion in Fraud Populatio % % % % % % accura chevrolet ford honda mazda pontiac toyota 50.00% 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Make ACCIDENT_AREA Analysis of Clusters of Fraudulent Claims by Accident Area % Lift Over Proportion in Fraud Populatio % % 80.00% 60.00% 40.00% 20.00% accident_area - urban accident_area - rural 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Accident Area VEHICLE_CATEGORY Analysis of Clusters of Fraudulent Claims by Vehicle Category % Lift Over Proportion in Fraud Populatio % % % 50.00% vehicle_category - sedan vehicle_category - sport vehicle_category - utility 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Vehicle Category A24

117 AGE_PRICE_WSUM Analysis of Clusters of Fraudulent Claims by Weighted Sum of Vehicle Age and Price % Lift Over Proportion in Fraud Populatio % % % % % 50.00% age_price_wsum - 1 age_price_wsum age_price_wsum age_price_wsum age_price_wsum age_price_wsum age_price_wsum % Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Weighted Sum of Vehicle Age and Price NUMBER_OF_CARS Analysis of Clusters of Fraudulent Claims by Number of Cars Insured by Policyholder Lift Over Proportion in Fraud Populatio % % % % % % 80.00% 60.00% 40.00% number_of_cars - X<3 number_of_cars - 3<=X<= % 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Number of Cars Insured by Policyholder BASE_POLICY Analysis of Clusters of Fraudulent Claims by Base Policy % Lift Over Proportion in Fraud Populatio % % % 50.00% base_policy - all-perils base_policy - collision base_policy - liability 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Base Policy A25

118 APPENDIX V UNDERSTANDING PREDICTIONS ACTUAL EXCEL ANALYSIS UNDETECTED BY THREE ALGORITHMS Analysis of Clusters for Fraudulent Claims Undetected by All Three Algorithms Lift Over Proportion in Fraud Populatio % % % % % % 50.00% 0.00% % % % 80.58% 0.00% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Undetected Fraud Undetected by Three Algorithms UNDETECTED BY TWO ALGORITHMS Analysis of Clusters for Fraudulent Claims Undetected by Two Algorithms Lift Over Proportion in Fraud Populatio % % % % % % 80.00% 60.00% 40.00% 20.00% 0.00% % % 55.67% 51.82% 53.61% Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Analysis of Clusters by Undetected Fraud Undetected by Two Algorithms A26

119 REFERENCES Abagnale F (2001) The Art of the Steal: How to Protect Yourself and Your Business from Fraud, Transworld Publishers, NSW, Australia. Abbott D, Matkovsky P and Elder J (1998) "An Evaluation of High-end Data Mining Tools for Fraud Detection", in Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, CA, USA. Angoss Software Corporation (1998) KnowledgeSeeker IV Help, Ontario, Canada. Baldock T (1997) "Insurance Fraud", Trends and Issues in Crime and Criminal Justice, Australian Institute of Criminology, 66, pp1-6. Bayes T (1763) "An essay towards solving a problem in the doctrine of chances", Philosophical Transactions of the Royal Society of London, 53, pp Berry M and Linoff G (2000) Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley and Sons, New York, USA. Bigus J (1996) Data mining with neural networks, McGraw Hill, New York, USA. Board of Trustees of the University of Illinois (2003) D2K TM Data to Knowledge TM Toolkit User Manual, Illinois, USA. Brachman R J and Anand T (1996) "The Process of Knowledge Discovery in Databases", in Fayyad et al (eds), in Advances in Knowledge Discovery and Data Mining, AAAI Press, CA, USA. Brause R, Langsdorf and Hepp M (1999) "Neural Data Mining for Credit Card Fraud Detection", in Proceedings of 11th IEEE International Conference on Tools with Artificial Intelligence, Illinois, USA. Breiman L (1994) Heuristics of instability in model selection, Technical Report, Department of Statistics, University of California at Berkeley, USA. Brockett P, Xia X and Derrig R (1998) "Using Kohonen's Self Organising Feature Map to Uncover Automobile Bodily Injury Claims Fraud", Journal of Risk and Insurance, USA. Cabena P, Hadjinian P, Stadler R, Verhees J and Zanasi A (1998) Discovering Data Mining: From Concept to Implementation, Prentice Hall Inc, CA, USA.

120 Cahill M, Lambert D, Pinheiro J and Sun D (2002) "Detecting Fraud In The Real World", in The Handbook of Massive Datasets, Kluwer Academic Publishers, pp Chan P, Fan W, Prodromidis A and Stolfo S (1999) "Distributed data mining in credit card fraud detection", IEEE Intelligent Systems, 14, pp Chan P and Stolfo S (1995) "A Comparative Evaluation of Voting and Meta-learning on Partitioned Data", in Proceedings of 12 th International Conference on Machine Learning, pp Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C and Wirth R (2000) CRISP-DM 1.0: Step-by-step data mining guide, CRISP-DM Consortium, Denmark, Germany, Netherlands, USA. Chen Z (2001) Data Mining and Uncertain Reasoning: An Integrated Approach, John Wiley and Sons, New York, USA. CoIL Challenge (2000) The Insurance Company Case, Technical Report , Leiden Institute of Advanced Computer Science, Netherlands. Converium (2002) Tackling Insurance Fraud - Law and Practice, White Paper, Zurich, Switzerland. Cox E (1995) "A Fuzzy System for Detecting Anomalous Behaviours in Healthcare Provider Claims", in Goonatilake S and Treleaven P (eds), in Intelligent Systems for Finance and Business, John Wiley and Sons, Chichester, England, pp Dick P K (1956) Minority Report, Orion Publishing Group, London, Great Britain. Domingos P and Pazzani M (1996) "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", in Proceedings of the 13 th Conference on Machine Learning, pp Elkan C (1997) Naive Bayesian Learning, Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, USA. Elkan C (2001) Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000, Department of Computer Science and Engineering, University of California, San Diego, USA. Estivill-Castro V and Lee I (2001) Data Mining Techniques for Autonomous Exploration of Large Volumes of Geo-Referenced Crime Data, Department of Computer Science and Software Engineering, University of Newcastle, NSW, Australia.

121 FairIsaac (2003a) Application Fraud Models, CA, USA. FairIsaac (2003b) Optimising Business Practices to Reduce Fraud Loss, CA, USA. Fawcett T and Provost F (1997) "Adaptive fraud detection", Data Mining and Knowledge Discovery Journal, 1, pp Fawcett T and Provost F (1996) "Combining Data Mining and Machine Learning for Effective User Profiling", in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, USA. Fayyad U, Piatetsky-Shapiro G, Smyth P and Uthurusamy R (eds) (1996) Advances in knowledge discovery and data mining, AAAI Press, CA, USA. Feelders A J (2003) "Statistical Concepts", in Berthold M and Hand D (eds), in Intelligent Data Analysis, Springer-Verlag, Berlin, Germany, pp Grossman R, Bailey S, Ramu A, Malhi B, Sivakumar H and Turinsky A (1999) "Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters", in Proceedings of Supercomputing. Grossman R (2001) A Top-Ten List for Data Mining, In SIAM News, Volume 34. Groth R (1998) Data Mining: a Hands-on Approach for Business Professionals, Prentice Hall Inc, NJ, USA, pp Guo Y, Ruger S, Sutiwaraphun J and Forbes-Millott J (1997) "Meta-learning for parallel data mining", in Proceedings of the 7 th Parallel Computing Workshop. Han J and Kamber M (2001) Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers. He H, Wang J, Graco W and Hawkins S (1998) "Application of Neural Networks to Detection of Medical Fraud", Expert Systems with Applications, 13, pp Hsu J (2002) "Data Mining Trends and Developments: The Key Data Mining Technologies and Applications for the 21st Century", in Proceedings of ISECON 2002, 19, AITP Foundation for Information Technology Education. Infoglide Software Corporation (2002) Bladeworks Solution for Insurance Fraud Detection, Texas, USA. Insurance Information Institute (2003) Facts and Statistics on Auto Insurance, NY, USA.

122 James F (2002) "FBI has eye on business databases", Chicago Tribune, Knight Ridder/Tribune Information Services. Kargupta H, Park B H, Hershberger D and Johnson E (1999) "Collective data mining: A new perspective toward distributed data mining", in Kargupta H and Chan P (eds), in Advances in Distributed and Parallel Knowledge Discovery, MIT/AAAI Press, MA, USA, pp Kargupta H and Chan P (2000) Advances in Distributed and Parallel Knowledge Discovery, MIT/AAAI Press, MA, USA. Keim D and Ward M (2003) "Visualisation", in Berthold M and Hand D (eds), in Intelligent Data Analysis, Springer-Verlag, Berlin, Germany, pp Kohonen T (1982) "Self-organised Formation of Topologically Correct Feature Maps", Biological Cybernetics, 43, pp KPMG (2002a) Australian Fraud Survey Report, Australia. KPMG (2002b) Singapore Fraud Survey Report, Singapore. Levine N (1999) "CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident Locations", in Proceedings of 4 th International Conference on Geocomputation. Maes S, Tuyls K, Vanschoenwinkel B and Manderick B (2002) "Credit Card Fraud Detection Using Bayesian and Neural Networks", in Proceedings of NF2002, Havana Cuba. Magnify (2002a) FraudFocus TM Advanced Fraud Detection,White Paper, Chicago, USA. Magnify (2002b) The Evolution of Insurance Fraud Detection: Lessons learnt from other industries, White Paper, Chicago, USA. Major J and Riedinger D (1995) "EFD: Heuristic Statistics for Insurance Fraud Detection", in Goonatilake S and Treleaven P (eds), in Intelligent Systems for Finance and Business, John Wiley and Sons, Chichester, England, pp Mena J (2003a) Data Mining for Homeland Security, Executive Briefing, VA, USA. Mena J (2003b) Investigative Data Mining for Security and Criminal Detection, Butterworth Heinemann, MA, USA. Mickelburough P and Kelly J (2003) "Millions named in files", Herald Sun, The Herald and Weekly Times.

123 Minsky M and Papert S (1969), Perceptrons: An Introduction to Computational Geometry, Expanded Edition 1988, MIT Press, Massachusetts, USA, pp National Association of Insurance Commissioners (2003) Uniform Suspected Insurance Fraud Reporting Form, USA. National White Collar Crime Center (2003) Insurance Fraud, White Paper, WV, USA. Negnevitsky M (2002) Artificial Intelligence: A Guide to Intelligent Systems, Pearson Education Limited, London, UK. O'Donnell (2002) "Gaining on the Fraudsters", InsuranceTech, CMP Media LLC. Ormerod T, Morley N, Ball L, Langley C and Spenser C (2003) "Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud", CHI 2003: New Horizons, ACM Press, Florida, USA. Payscale (2003) HOURLYRATE/fid-6886, date last updated: 2003, date accessed: 7 th September Phua C (2003a) Choosing and Explaining Likely Moped Customers, Department of Business Systems, Monash University, Victoria, Australia. Phua C (2003b) Distributed Data Mining using Software Agents, Department of Business Systems, Monash University, Victoria, Australia. Prodromidis A (1999) Management of Intelligent Learning Agents in Distributed Data Mining Systems, Unpublished PhD thesis, Columbia University, USA. Provost F (2000) "Distributed Data Mining: Scaling Up and Beyond", in Kargupta H and Chan P (eds), in Advances in Distributed and Parallel Knowledge Discovery, MIT/AAAI Press, MA, USA, pp3-27. Pyle D (1999) Data Preparation for Data Mining, Morgan Kaufmann Publishers, San Francisco, USA. Quinlan J R (1993) C4.5 Programs for Machine Learning, Morgan Kauffman, CA, USA. Reason J (1990) Human Error, Cambridge University Press, Cambridge, Great Britain. Rosenblatt F (1958) "The Perceptron: A Probabilistic Model for Information Storage and Organisation in the Brain", Pyschological Review, 65, pp

124 Rumelhart D E and McClelland J L (eds) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, MIT Press, MA, USA. Salzberg S L (1997) "On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach", Data Mining and Knowledge Discovery, Kluwer, 1, pp SAS e-intelligence (2000) Data Mining in the Insurance Industry: Solving Business Problems using SAS Enterprise Miner Software, White Paper, USA. Shao H, Zhao H and Chang G (2002) "Applying Data Mining to Detect Fraud Behaviour in Customs Declaration", in Proceedings of the First International Conference on Machine Learning and Cybernatics 2002, Beijing, China. Silipo R (2003) "Neural Networks", in Berthold M and Hand D (eds), in Intelligent Data Analysis, Springer-Verlag, Berlin, Germany, pp Sparrow M K (2002) "Fraud Control in the Health Care Industry: Assessing the State of the Art", in Shichor et al (eds), in Readings in White-Collar Crime, Waveland Press, Illinois, USA. Smith K (1999) Introduction to Neural Networks and Data Mining for Business Applications, Eruditions Publishing, Melbourne, Australia. Smith R (2003) "Serious Fraud in Australia and New Zealand", Research and Public Policy Series, Australian Institute of Criminology and PricewaterhouseCoopers, Victoria, Australia. Soukup T and Davidson I (2002) Visual Data Mining: Techniques and Tools for Data Visualisation and Mining, John Wiley and Sons, New York, USA. SPSS (2003) Data mining and crime analysis in the Richmond Police Department, White Paper, Virginia, USA. Stolfo S, Prodromidis A, Tselepis S, Lee W, Fan D and Chan P (1997) "JAM: Java Agents for Metalearning over Distributed Databases", in Proceedings of KDD-97 (runner up best paper, applications) and AAAI'97 Work, AI Methods in Fraud and Risk Management. Tan C (2003) "Crackdown on Motor Insurance Fraudsters", The Straits Times, Singapore Press Holdings. Von Altrock C (1995) Fuzzy Logic and NeuroFuzzy Applications in Business and Finance, Prentice Hall, NJ, USA, pp

125 Ward Systems Group (1996) NeuroShell2 Help, MD, USA. Weatherford M (2002) "Mining for Fraud", IEEE Intelligent Systems, July/August Issue, pp4-6. Williams G (1999) "Evolutionary Hot Spots Data Mining: An Architecture for Exploring for Interesting Discoveries", Lecture Notes in Artificial Intelligence, Volume 1574, Springer- Verlag. Williams G and Huang Z (1997) "Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases", Lecture Notes in Artificial Intelligence, Springer- Verlag. Witten I and Frank E (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, CA, USA. Wolpert D H (1992) "Stacked Generalization", Neural Networks, 5, pp Zadrozny B and Elkan C (2001) "Learning and making decisions when costs and probabilities are both unknown", in Proceedings of the 7 th International Conference on Knowledge Discovery and Data Mining, AAAI Press.

126