Overview Data Mining Algorithms Data Mining End to End Graham Williams Principal Data Miner, ATO Adjunct Associate Professor, ANU 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy http://togawarecom Copyright c 26, Graham J Williams 1/48/1 http://togawarecom Copyright c 26, Graham J Williams 3/4 The CRISP-DM Model Going Through Loops Cyclic nature of data mining: Cross Industry Standard Process for Data Mining http://wwwcrisp-dmorg/ Developed by NCR, Daimler-Benz, ISL, OHRA Define and validate a Data Mining Process Model applicable in diverse industry sectors industry and tool neutral large data mining projects executed faster, cheaper, more reliably and more manageably Life cycle of six (iterative) phases Every step of a data mining process can lead to revisiting any one of the previous steps A DM process continues after a solution has been deployed The lessons learnt can trigger new, often more focused business questions Subsequent data mining processes benefit from experiences of previous ones http://togawarecom Copyright c 26, Graham J Williams 4/48/3 http://togawarecom Copyright c 26, Graham J Williams 5/4 Six Steps Business Understanding 1 Business Understanding (25%) 2 Data Understanding (2%) 3 Data Preparation (25%) 4 Modelling (1%) 5 Evaluation (2%) 6 Deployment 4 Analysis 1% 1 Find Objectives 2% 3 Data Mining 1% 2 Data Preparation 6% We had better make sure we are addressing a real business problem Initial phase focuses on understanding project objectives and requirements from a business perspective This knowledge is converted into a data mining problem definition Develop a preliminary plan designed to achieve the objectives http://togawarecom Copyright c 26, Graham J Williams 6/48/5 http://togawarecom Copyright c 26, Graham J Williams 7/4
Data Understanding Data Preparation Understand what data is available and its semantics Initial data collection Familiarisation with the data identify data quality problems discover first insights into the data detect interesting subsets to form hypotheses for hidden information Bring together the data get it into shape for mining Construct the mining dataset Derived from the initial raw dataset(s) Data preparation tasks: table, record, and attribute selection generation of derived features data transformation data cleaning http://togawarecom Copyright c 26, Graham J Williams 8/48/7 http://togawarecom Copyright c 26, Graham J Williams 9/4 Preparing to Mine Modelling Issues to be dealt with include: Data Quality missing data noisy data lead to inconsistent or too general/specific discoveries Data Cleaning duplicates inconsistencies identify and merge the same entities Now the data mining begins!!! Select various modelling techniques Apply and calibrate modelling techniques Typically there are several techniques for the same data mining problem Some techniques have specific requirements on the form of data and require stepping back to the data preparation phase http://togawarecom Copyright c 26, Graham J Williams 1/48/9 http://togawarecom Copyright c 26, Graham J Williams 11/ Evaluation Deployment How do we know we have a useful outcome? Evaluate the model and review the steps executed to construct the model Does the model properly achieve the business objectives? Is there some important business issue that has not been sufficiently considered? Decide on the use of the data mining results No point to data mining unless we action the outcomes Deployment may be: Generate a report of the discoveries made Implement changes in the processes of the organisation Implement a repeatable data mining process For successful deployment the customer must understand the actions to be carried out in order to actually make use of the created models http://togawarecom Copyright c 26, Graham J Williams 12/48/11 http://togawarecom Copyright c 26, Graham J Williams 13/
Summary Overview The KDD Process Interative process requiring multiple loops Time consuming Mining is one small step Data issues are crucial to success 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy http://togawarecom Copyright c 26, Graham J Williams 14/48/13 http://togawarecom Copyright c 26, Graham J Williams 15/ Motor Vehicle Insurance Cluster then Describe then Measure Insurance premium setting and risk rating Actuaries study data and domain for general understanding of risk Several million transactions annually Consider more than the traditional small number of factors Data mining can explore very large collections of data both entities and features The Hot Spots methodology combines Cluster Analysis and Decision Trees to symbolically identify candidate regions of a dataset ClaimCost $695 SumRqst $15, C1 Cost = $95, Model [holden, ford] ClmType 6 C2 Postcode 2949 C3 C2 Cubic 24 C4 C1 C3 Cost = $158, http://togawarecom Copyright c 26, Graham J Williams 16/48/15 http://togawarecom Copyright c 26, Graham J Williams 17/ Find the Interesting Groups Finding the Interesting Groups Rule 1 Rule 23 NCB < 6 and Age 24 and Address is Urban Age > 57 and Vehicle {Utility, Station Wagon} Evaluate the large collection of groups (or Hot Spots) to find those that are important to the core business Nugget Claims Total Proportion Average Cost Total Cost 1 15 14 11 37 545, 2 14 23 6 38 535, 3 5 25 2 44 13, 4 1 12 8 79 79,1 5 2 34 6 53 116, 6 65 52 13 44 28,7 7 5 5 1 68 2,3 6 8 14 59 35 2,8, All 38 72 5 3 12,, Nugget By Claims By Proportion By Average Cost 2 Y 3 Y 19 Y 24 Y 34 Y Y Y 35 Y Y 36 Y 4 Y Y http://togawarecom Copyright c 26, Graham J Williams 18/48/17 http://togawarecom Copyright c 26, Graham J Williams 19/
7 6 5 4 3 2 1 pina 126 127 128 129 13 131 132 133 134 35 3 25 2 15 1 5 pinb 126 127 128 129 13 131 132 133 134 3 25 2 15 1 5 pinc 126 127 128 129 13 131 132 133 134 Find the Interesting Groups Operationalise Rule 1 Rule 23 NCB < 6 and Age 24 and Address is Urban Age > 57 and Vehicle {Utility, Station Wagon} Nugget Claims Total Proportion Average Cost Total Cost 1 15 14 11 37 545, 2 14 23 6 38 535, 3 5 25 2 44 13, 4 1 12 8 79 79,1 5 2 34 6 53 116, 6 65 52 13 44 28,7 7 5 5 1 68 2,3 6 8 14 59 35 2,8, All 38 72 5 3 12,, Identify groups that are: High Risk Very high dollars per claim Large percentage of claims in the group Low Risk Very few claims from the group Claims are low in dollars http://togawarecom Copyright c 26, Graham J Williams 2/48/19 http://togawarecom Copyright c 26, Graham J Williams 21/ Health Insurance Commission Cluster/Describe/Measure Universal Health Coverage Terabytes of patient claims since the inception of Medicare Inappropriate Provider practices an ongoing focus Exploration of public fraud (including doctor shoppers) Exploration of the practise of pathology ClaimCost $695 SumRqst $15, C1 Cost = $95, Model [holden, ford] ClmType 6 C2 Postcode 2949 C3 C2 Cubic 24 C4 C1 C3 Cost = $158, http://togawarecom Copyright c 26, Graham J Williams 22/48/21 http://togawarecom Copyright c 26, Graham J Williams 23/ Cluster/Describe/Deliver Claim Hoarders Rule 1 Age is between 28 and 35 and Weeks 5 Rule 2 Weeks < 1 and Benefits > $35 Nugget Size Age Gender Services Benefits Weeks Hoard Regular 1 9 3 F 1 3 2 1 1 2 15 3 F 24 841 4 2 4 3 12 65 M 7 22 2 1 1 4 8 45 F 3 75 1 1 1 5 9 1 M 12 1125 1 5 2 6 8 55 M 8 55 7 1 9 28 3 25 F 15 45 15 2 6 All 4, 45 8 3 3 1 1 A distinct group of behaviour identified as Claim Hoarders But there may be many millions of these individuals http://togawarecom Copyright c 26, Graham J Williams 24/48/23 http://togawarecom Copyright c 26, Graham J Williams 25/
Medicare Regulars Operationalise Group of patients with very regular activity: 3 pinad 45 pind The fraud identified was investigated and appropriate action taken 4 25 2 15 1 35 3 25 2 15 1 Perpetrators prosecuted Funds recovered Processes improved to cross validate data 5 5 126 127 128 129 13 131 132 133 134 126 127 128 129 13 131 132 133 134 Remove non-cash payments!!! http://togawarecom Copyright c 26, Graham J Williams 26/48/25 http://togawarecom Copyright c 26, Graham J Williams 27/ Overview The Importance of Communication 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy Selling the story to management Do we present the model or the outcomes? Senior management in something like the ATO is necessarily cautious, protecting the integrity of the countries revenue system Need to demonstrate and prove the performance and robustness of models before deployment http://togawarecom Copyright c 26, Graham J Williams 28/48/27 http://togawarecom Copyright c 26, Graham J Williams 29/ Options in Rattle: Confusion Matrix A simple instrument to convey predictive performance But quite a blunt instrument Confusion matrix rpart model on auditcsv [test] (counts): Actual Predicted 1 428 56 1 44 72 Confusion matrix rpart model on auditcsv [test] (%): Options in Rattle: Risk Charts Developed specifically for the ATO Capture both the score exhibited through probability, and the size of the Risk associated with each case! Often, it is the Risk that is of most interest Actual Predicted 1 71 9 1 7 12 http://togawarecom Copyright c 26, Graham J Williams 3/48/29 http://togawarecom Copyright c 26, Graham J Williams 31/
Risk Chart: RPart Risk Chart: RF http://togawarecom Copyright c 26, Graham J Williams 32/48/31 http://togawarecom Copyright c 26, Graham J Williams 33/ Risk Chart: SVM Risk Chart: Textual Comparison The area under the Risk and Recall curves for rpart model Area under the Risk (red) curve: 79% (79) Area under the Recall (green) curve: 76% (762) The area under the Risk and Recall curves for rf model Area under the Risk (red) curve: 78% (78) Area under the Recall (green) curve: 78% (779) The area under the Risk and Recall curves for ksvm model Area under the Risk (red) curve: 78% (777) Area under the Recall (green) curve: 77% (774) Which is best? http://togawarecom Copyright c 26, Graham J Williams 34/48/33 http://togawarecom Copyright c 26, Graham J Williams 35/ Overview Privacy and Data Mining 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy Laws in many countries directly affect Data Mining and it is worth being aware of them penalties are often severe The OECD Principles of Data Collection were drafted in 198 They embody guiding principles for governments Revised for APEC 23 as part of the Asia-Pacific Privacy Charter Initiative Data mining by the Australian Taxation Office is governed by data matching and privacy protocols, and independently overseen by Privacy Commissioner, ANAO, and others http://togawarecom Copyright c 26, Graham J Williams 36/48/35 http://togawarecom Copyright c 26, Graham J Williams 37/
What are we trying to protect? Protect, amongst others: Religious freedom Freedom from racial discrimination Personal medical records Employment history Political freedom Centrelink and Privacy Breaches In August 26 CentreLink (Australian Social Security Agency) announced it had identified over 5 privacy breaches committed by staff Identified by monitoring and mining database access logs Activities looking at own personal records looking at family, friends, neighbours obtaining information to be sold changing information for financial gain Consequences counselling reduced pay/position lose job http://togawarecom Copyright c 26, Graham J Williams 38/48/37 http://togawarecom Copyright c 26, Graham J Williams 39/ AOL Privacy Breech Criminal Intent or Research AOL researchers thought to release anonimised web query logs for researchers (August 26) Covered 25, users and 2 million queries (Compare with US DoJ demand that Google, AOL, etc, supply such data for them to monitor their citizens) Usernames converted to numeric IDs But, aggregate queries for a single numeric ID enough to identify individuals multiple queries paint a picture private financial situation: property and bank loan enquiries indication of criminal activity or research for a book? health: pregnancy, home loan, dog vomit + uncooked pasta This was a screw-up and we re angry and upset about it AOL Would we use the following to investigate someone? how to change brake pads on scion xb 25 us open cup florida state champions how to get revenge on a ex how to get revenge on a ex girlfriend how to get revenge on a friend who you over replacement bumper for scion xb florida department of law enforcement crime stoppers florida Perhaps someone researching a novel! http://togawarecom Copyright c 26, Graham J Williams 4/48/39 http://togawarecom Copyright c 26, Graham J Williams 41/ A Distressed Victim? Privacy is Important A quite distressing example from the AOL disclosure: casey middle school surgical help for depression can you adopt after a suicide attempt gynecology oncologists in new york city Fishman David Dr 16 E 34th St, New York, 116 how to tell your family you re a victim of incest how long will the swelling last after my tummy tuck teaching positions in denver colorado divorce laws in ohio Privacy is important to ensure freedom from oppression Privacy can be breached either accidentally or purposefully How much should we allow our governments breach our privacy what is the right trade off? http://togawarecom Copyright c 26, Graham J Williams 42/48/41 http://togawarecom Copyright c 26, Graham J Williams 43/
Principles of Data Collection Principles of Data Collection 1 Collection Limitation: There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject 2 Data Quality: Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, compete and kept up-to-date 3 Purpose Specification: The purposes for which personal data are collected should be specified not later than at the time of collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose 4 Use limitation: Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [Principle 3] except: with the consent of the data subject; or by the authority of law http://togawarecom Copyright c 26, Graham J Williams 44/48/43 http://togawarecom Copyright c 26, Graham J Williams 45/ Principles of Data Collection Principles of Data Collection 5 Security Safeguards: Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data 6 Openness: There should be a general policy of openness about developments, practices and policies with respect to personal data Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller 7 Individual Participation: An individual should have the right to obtain confirmation of whether or not a data controller has data relating to them, and to have access to that data within a reasonable time and cost, and to be able to challenge any denial, and to be able to challenge data relating to themselves and, if the challenge is successful, to have the data erased, rectified, completed or amended 8 Accountability: A data controller should be accountable for complying with measures giving effect to these principles http://togawarecom Copyright c 26, Graham J Williams 46/48/45 http://togawarecom Copyright c 26, Graham J Williams 47/ Thank You http://togawarecom Copyright c 26, Graham J Williams 48/48/47