Overview. Data Mining Algorithms. Going Through Loops. The CRISP-DM Model. Six Steps. Business Understanding. Data Mining End to End.



Similar documents
Information Governance Policy

ROEHAMPTON UNIVERSITY DATA PROTECTION POLICY

How To Know What You Can And Can'T Do At The University Of England Students Union

Corporate ICT & Data Management. Data Protection Policy

HIPSSA Project. Support for Harmonization of the ICT Policies in Sub-Sahara Africa, Second Mission -Namibia

How To Understand The Data Protection Act

Little Marlow Parish Council Registration Number for ICO Z

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Data Protection Act a more detailed guide

QUEENSLAND COUNTRY HEALTH FUND. privacy policy. Queensland Country Health Fund Ltd ABN better health cover shouldn t hurt

Policy Document Control Page

Data Protection Act 1998 The Data Protection Policy for the Borough Council of King's Lynn & West Norfolk

The Privacy Act 1988 contains 10 National Privacy Principles (the NPPs) which specify how organisations should handle personal information.

Privacy Policy Draft

HERTSMERE BOROUGH COUNCIL

CORK INSTITUTE OF TECHNOLOGY

DATA PROTECTION POLICY. Examples of personal data which TWM may require from clients include the following and for the reasons ascribed to each;

UNIVERSITY OF SOUTHAMPTON DATA PROTECTION POLICY

Data Protection in Ireland

Database Marketing, Business Intelligence and Knowledge Discovery

Protection. Code of Practice. of Personal Data RPC001147_EN_WB_L_1

Data Protection and Information Security. Procedure for reporting a breach of data security. April 2013

PERSONAL INJURIES ASSESSMENT BOARD DATA PROTECTION CODE OF PRACTICE

DATA PROTECTION POLICY

DATA PROTECTION POLICY

Islington Data Protection Policy. A council-wide information policy Version 1.1 June 2014

technical factsheet 176

Code of Practice on Data Protection for the Insurance Sector

Data Protection Policy

MONMOUTHSHIRE COUNTY COUNCIL DATA PROTECTION POLICY

PRIVACY POLICY. Last updated February 2, 2009 INTRODUCTION

The kinds of personal information we collect and hold vary depending on the services we are providing, but generally can include:

Protection. Code of Practice. of Personal Data RPC001147_EN_D_19

communications between us and your financial, legal or other adviser, or your broker or agent;

UNILEVER PRIVACY PRINCIPLES UNILEVER PRIVACY POLICY

Once more unto the breach... Dealing with Personal Data Security Breaches. Helen Williamson Information Governance Officer

Data Protection Policy

PRIVACY POLICY. comply with the Australian Privacy Principles ("APPs"); ensure that we manage your personal information openly and transparently;

John Leggott College. Data Protection Policy. Introduction

OBJECTS AND REASONS. (a) the regulation of the collection, keeping, processing, use or dissemination of personal data;

PRIVACY POLICY Personal information and sensitive information Information we request from you

Disclosure Scheme. The Domestic Violence. Keeping People Safe from Domestic Violence

DESTINATION MELBOURNE PRIVACY POLICY

CCTV CODE OF PRACTICE

CCBE RECOMMENDATIONS FOR THE IMPLEMENTATION OF THE DATA RETENTION DIRECTIVE

CorporateGuard Comprehensive Crime Insurance

Derbyshire Constabulary GUIDANCE ON THE SAFE USE OF THE INTERNET AND SOCIAL MEDIA BY POLICE OFFICERS AND POLICE STAFF POLICY REFERENCE 09/268

DATA PROTECTION ACT 1998 COUNCIL POLICY

Office of the Data Protection Commissioner of The Bahamas. Data Protection (Privacy of Personal Information) Act, A Guide for Data Controllers

CRISP-DM: The life cicle of a data mining project. KDD Process

Data Protection and Privacy Policy

Disclosure is the action of making new or secret information known.

HIPAA Policy, Protection, and Pitfalls ARTHUR J. GALLAGHER & CO. BUSINESS WITHOUT BARRIERS

Privacy Charter. Protecting Your Privacy

DISASTER RECOVERY INSTITUTE CANADA WEBSITE PRIVACY POLICY (DRIC) UPDATED APRIL 2004

Data Sharing Protocol

Using Data Mining to Detect Insurance Fraud

THE PERSONAL INFORMATION PROTECTION AND ELECTRONIC DOCUMENTS ACT (PIPEDA) PERSONAL INFORMATION POLICY & PROCEDURE HANDBOOK

The CPS incorporates RCPO. CPS Data Protection Policy

Data Protection Act. Conducting privacy impact assessments code of practice

Policy on Public and School Bus Closed Circuit Television Systems (CCTV)

1. Introduction. 2. Sectoral Areas Affected. 3. Data Security. 4. Data Breach Requirements. 5. Traffic Data

UAB MY HEALTH REWARDS BIOMETRIC SCREENING PROGRAM NOTICE OF HEALTH INFORMATION PRACTICES

AASA Online Privacy Policy CRP.020

Information Governance Strategy & Policy

Information Governance Framework. June 2015

Iowa Student Loan Online Privacy Statement

Data protection policy

Data Protection for the Guidance Counsellor. Issues To Plan For

AVE MARIA UNIVERSITY HIPAA PRIVACY NOTICE

AlixPartners, LLP. General Data Protection Statement

Information Incident Management Policy

Evolutionary Hot Spots Data Mining. An Architecture for Exploring for Interesting Discoveries

Data protection policy

Trading Terms 1. Payment 2. Orders 3. Freight/Postage (GST Applicable) 4. Pricing Policy (GST Applicable) 5. Invoice Format

Data Science with R. Introducing Data Mining with Rattle and R.

Data Protection Good Practice Note

Once you have submitted the online medical assessment you will receive an online reference number. ONLINE REFERENCE NUMBER Smartform number

HIPAA Notice of Patient Privacy Practices

Merthyr Tydfil County Borough Council. Data Protection Policy

HIPAA The Law Explained. Click here to view the HIPAA information.

Transcription:

Overview Data Mining Algorithms Data Mining End to End Graham Williams Principal Data Miner, ATO Adjunct Associate Professor, ANU 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy http://togawarecom Copyright c 26, Graham J Williams 1/48/1 http://togawarecom Copyright c 26, Graham J Williams 3/4 The CRISP-DM Model Going Through Loops Cyclic nature of data mining: Cross Industry Standard Process for Data Mining http://wwwcrisp-dmorg/ Developed by NCR, Daimler-Benz, ISL, OHRA Define and validate a Data Mining Process Model applicable in diverse industry sectors industry and tool neutral large data mining projects executed faster, cheaper, more reliably and more manageably Life cycle of six (iterative) phases Every step of a data mining process can lead to revisiting any one of the previous steps A DM process continues after a solution has been deployed The lessons learnt can trigger new, often more focused business questions Subsequent data mining processes benefit from experiences of previous ones http://togawarecom Copyright c 26, Graham J Williams 4/48/3 http://togawarecom Copyright c 26, Graham J Williams 5/4 Six Steps Business Understanding 1 Business Understanding (25%) 2 Data Understanding (2%) 3 Data Preparation (25%) 4 Modelling (1%) 5 Evaluation (2%) 6 Deployment 4 Analysis 1% 1 Find Objectives 2% 3 Data Mining 1% 2 Data Preparation 6% We had better make sure we are addressing a real business problem Initial phase focuses on understanding project objectives and requirements from a business perspective This knowledge is converted into a data mining problem definition Develop a preliminary plan designed to achieve the objectives http://togawarecom Copyright c 26, Graham J Williams 6/48/5 http://togawarecom Copyright c 26, Graham J Williams 7/4

Data Understanding Data Preparation Understand what data is available and its semantics Initial data collection Familiarisation with the data identify data quality problems discover first insights into the data detect interesting subsets to form hypotheses for hidden information Bring together the data get it into shape for mining Construct the mining dataset Derived from the initial raw dataset(s) Data preparation tasks: table, record, and attribute selection generation of derived features data transformation data cleaning http://togawarecom Copyright c 26, Graham J Williams 8/48/7 http://togawarecom Copyright c 26, Graham J Williams 9/4 Preparing to Mine Modelling Issues to be dealt with include: Data Quality missing data noisy data lead to inconsistent or too general/specific discoveries Data Cleaning duplicates inconsistencies identify and merge the same entities Now the data mining begins!!! Select various modelling techniques Apply and calibrate modelling techniques Typically there are several techniques for the same data mining problem Some techniques have specific requirements on the form of data and require stepping back to the data preparation phase http://togawarecom Copyright c 26, Graham J Williams 1/48/9 http://togawarecom Copyright c 26, Graham J Williams 11/ Evaluation Deployment How do we know we have a useful outcome? Evaluate the model and review the steps executed to construct the model Does the model properly achieve the business objectives? Is there some important business issue that has not been sufficiently considered? Decide on the use of the data mining results No point to data mining unless we action the outcomes Deployment may be: Generate a report of the discoveries made Implement changes in the processes of the organisation Implement a repeatable data mining process For successful deployment the customer must understand the actions to be carried out in order to actually make use of the created models http://togawarecom Copyright c 26, Graham J Williams 12/48/11 http://togawarecom Copyright c 26, Graham J Williams 13/

Summary Overview The KDD Process Interative process requiring multiple loops Time consuming Mining is one small step Data issues are crucial to success 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy http://togawarecom Copyright c 26, Graham J Williams 14/48/13 http://togawarecom Copyright c 26, Graham J Williams 15/ Motor Vehicle Insurance Cluster then Describe then Measure Insurance premium setting and risk rating Actuaries study data and domain for general understanding of risk Several million transactions annually Consider more than the traditional small number of factors Data mining can explore very large collections of data both entities and features The Hot Spots methodology combines Cluster Analysis and Decision Trees to symbolically identify candidate regions of a dataset ClaimCost $695 SumRqst $15, C1 Cost = $95, Model [holden, ford] ClmType 6 C2 Postcode 2949 C3 C2 Cubic 24 C4 C1 C3 Cost = $158, http://togawarecom Copyright c 26, Graham J Williams 16/48/15 http://togawarecom Copyright c 26, Graham J Williams 17/ Find the Interesting Groups Finding the Interesting Groups Rule 1 Rule 23 NCB < 6 and Age 24 and Address is Urban Age > 57 and Vehicle {Utility, Station Wagon} Evaluate the large collection of groups (or Hot Spots) to find those that are important to the core business Nugget Claims Total Proportion Average Cost Total Cost 1 15 14 11 37 545, 2 14 23 6 38 535, 3 5 25 2 44 13, 4 1 12 8 79 79,1 5 2 34 6 53 116, 6 65 52 13 44 28,7 7 5 5 1 68 2,3 6 8 14 59 35 2,8, All 38 72 5 3 12,, Nugget By Claims By Proportion By Average Cost 2 Y 3 Y 19 Y 24 Y 34 Y Y Y 35 Y Y 36 Y 4 Y Y http://togawarecom Copyright c 26, Graham J Williams 18/48/17 http://togawarecom Copyright c 26, Graham J Williams 19/

7 6 5 4 3 2 1 pina 126 127 128 129 13 131 132 133 134 35 3 25 2 15 1 5 pinb 126 127 128 129 13 131 132 133 134 3 25 2 15 1 5 pinc 126 127 128 129 13 131 132 133 134 Find the Interesting Groups Operationalise Rule 1 Rule 23 NCB < 6 and Age 24 and Address is Urban Age > 57 and Vehicle {Utility, Station Wagon} Nugget Claims Total Proportion Average Cost Total Cost 1 15 14 11 37 545, 2 14 23 6 38 535, 3 5 25 2 44 13, 4 1 12 8 79 79,1 5 2 34 6 53 116, 6 65 52 13 44 28,7 7 5 5 1 68 2,3 6 8 14 59 35 2,8, All 38 72 5 3 12,, Identify groups that are: High Risk Very high dollars per claim Large percentage of claims in the group Low Risk Very few claims from the group Claims are low in dollars http://togawarecom Copyright c 26, Graham J Williams 2/48/19 http://togawarecom Copyright c 26, Graham J Williams 21/ Health Insurance Commission Cluster/Describe/Measure Universal Health Coverage Terabytes of patient claims since the inception of Medicare Inappropriate Provider practices an ongoing focus Exploration of public fraud (including doctor shoppers) Exploration of the practise of pathology ClaimCost $695 SumRqst $15, C1 Cost = $95, Model [holden, ford] ClmType 6 C2 Postcode 2949 C3 C2 Cubic 24 C4 C1 C3 Cost = $158, http://togawarecom Copyright c 26, Graham J Williams 22/48/21 http://togawarecom Copyright c 26, Graham J Williams 23/ Cluster/Describe/Deliver Claim Hoarders Rule 1 Age is between 28 and 35 and Weeks 5 Rule 2 Weeks < 1 and Benefits > $35 Nugget Size Age Gender Services Benefits Weeks Hoard Regular 1 9 3 F 1 3 2 1 1 2 15 3 F 24 841 4 2 4 3 12 65 M 7 22 2 1 1 4 8 45 F 3 75 1 1 1 5 9 1 M 12 1125 1 5 2 6 8 55 M 8 55 7 1 9 28 3 25 F 15 45 15 2 6 All 4, 45 8 3 3 1 1 A distinct group of behaviour identified as Claim Hoarders But there may be many millions of these individuals http://togawarecom Copyright c 26, Graham J Williams 24/48/23 http://togawarecom Copyright c 26, Graham J Williams 25/

Medicare Regulars Operationalise Group of patients with very regular activity: 3 pinad 45 pind The fraud identified was investigated and appropriate action taken 4 25 2 15 1 35 3 25 2 15 1 Perpetrators prosecuted Funds recovered Processes improved to cross validate data 5 5 126 127 128 129 13 131 132 133 134 126 127 128 129 13 131 132 133 134 Remove non-cash payments!!! http://togawarecom Copyright c 26, Graham J Williams 26/48/25 http://togawarecom Copyright c 26, Graham J Williams 27/ Overview The Importance of Communication 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy Selling the story to management Do we present the model or the outcomes? Senior management in something like the ATO is necessarily cautious, protecting the integrity of the countries revenue system Need to demonstrate and prove the performance and robustness of models before deployment http://togawarecom Copyright c 26, Graham J Williams 28/48/27 http://togawarecom Copyright c 26, Graham J Williams 29/ Options in Rattle: Confusion Matrix A simple instrument to convey predictive performance But quite a blunt instrument Confusion matrix rpart model on auditcsv [test] (counts): Actual Predicted 1 428 56 1 44 72 Confusion matrix rpart model on auditcsv [test] (%): Options in Rattle: Risk Charts Developed specifically for the ATO Capture both the score exhibited through probability, and the size of the Risk associated with each case! Often, it is the Risk that is of most interest Actual Predicted 1 71 9 1 7 12 http://togawarecom Copyright c 26, Graham J Williams 3/48/29 http://togawarecom Copyright c 26, Graham J Williams 31/

Risk Chart: RPart Risk Chart: RF http://togawarecom Copyright c 26, Graham J Williams 32/48/31 http://togawarecom Copyright c 26, Graham J Williams 33/ Risk Chart: SVM Risk Chart: Textual Comparison The area under the Risk and Recall curves for rpart model Area under the Risk (red) curve: 79% (79) Area under the Recall (green) curve: 76% (762) The area under the Risk and Recall curves for rf model Area under the Risk (red) curve: 78% (78) Area under the Recall (green) curve: 78% (779) The area under the Risk and Recall curves for ksvm model Area under the Risk (red) curve: 78% (777) Area under the Recall (green) curve: 77% (774) Which is best? http://togawarecom Copyright c 26, Graham J Williams 34/48/33 http://togawarecom Copyright c 26, Graham J Williams 35/ Overview Privacy and Data Mining 1 Process CRISP-DM 2 Hot Spots NRMA Medicare 3 Evaluation and Communication Communicating Performance 4 Privacy Protecting Privacy Laws in many countries directly affect Data Mining and it is worth being aware of them penalties are often severe The OECD Principles of Data Collection were drafted in 198 They embody guiding principles for governments Revised for APEC 23 as part of the Asia-Pacific Privacy Charter Initiative Data mining by the Australian Taxation Office is governed by data matching and privacy protocols, and independently overseen by Privacy Commissioner, ANAO, and others http://togawarecom Copyright c 26, Graham J Williams 36/48/35 http://togawarecom Copyright c 26, Graham J Williams 37/

What are we trying to protect? Protect, amongst others: Religious freedom Freedom from racial discrimination Personal medical records Employment history Political freedom Centrelink and Privacy Breaches In August 26 CentreLink (Australian Social Security Agency) announced it had identified over 5 privacy breaches committed by staff Identified by monitoring and mining database access logs Activities looking at own personal records looking at family, friends, neighbours obtaining information to be sold changing information for financial gain Consequences counselling reduced pay/position lose job http://togawarecom Copyright c 26, Graham J Williams 38/48/37 http://togawarecom Copyright c 26, Graham J Williams 39/ AOL Privacy Breech Criminal Intent or Research AOL researchers thought to release anonimised web query logs for researchers (August 26) Covered 25, users and 2 million queries (Compare with US DoJ demand that Google, AOL, etc, supply such data for them to monitor their citizens) Usernames converted to numeric IDs But, aggregate queries for a single numeric ID enough to identify individuals multiple queries paint a picture private financial situation: property and bank loan enquiries indication of criminal activity or research for a book? health: pregnancy, home loan, dog vomit + uncooked pasta This was a screw-up and we re angry and upset about it AOL Would we use the following to investigate someone? how to change brake pads on scion xb 25 us open cup florida state champions how to get revenge on a ex how to get revenge on a ex girlfriend how to get revenge on a friend who you over replacement bumper for scion xb florida department of law enforcement crime stoppers florida Perhaps someone researching a novel! http://togawarecom Copyright c 26, Graham J Williams 4/48/39 http://togawarecom Copyright c 26, Graham J Williams 41/ A Distressed Victim? Privacy is Important A quite distressing example from the AOL disclosure: casey middle school surgical help for depression can you adopt after a suicide attempt gynecology oncologists in new york city Fishman David Dr 16 E 34th St, New York, 116 how to tell your family you re a victim of incest how long will the swelling last after my tummy tuck teaching positions in denver colorado divorce laws in ohio Privacy is important to ensure freedom from oppression Privacy can be breached either accidentally or purposefully How much should we allow our governments breach our privacy what is the right trade off? http://togawarecom Copyright c 26, Graham J Williams 42/48/41 http://togawarecom Copyright c 26, Graham J Williams 43/

Principles of Data Collection Principles of Data Collection 1 Collection Limitation: There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject 2 Data Quality: Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, compete and kept up-to-date 3 Purpose Specification: The purposes for which personal data are collected should be specified not later than at the time of collection and the subsequent use limited to the fulfilment of those purposes or such others as are not incompatible with those purposes and as are specified on each occasion of change of purpose 4 Use limitation: Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [Principle 3] except: with the consent of the data subject; or by the authority of law http://togawarecom Copyright c 26, Graham J Williams 44/48/43 http://togawarecom Copyright c 26, Graham J Williams 45/ Principles of Data Collection Principles of Data Collection 5 Security Safeguards: Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data 6 Openness: There should be a general policy of openness about developments, practices and policies with respect to personal data Means should be readily available of establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller 7 Individual Participation: An individual should have the right to obtain confirmation of whether or not a data controller has data relating to them, and to have access to that data within a reasonable time and cost, and to be able to challenge any denial, and to be able to challenge data relating to themselves and, if the challenge is successful, to have the data erased, rectified, completed or amended 8 Accountability: A data controller should be accountable for complying with measures giving effect to these principles http://togawarecom Copyright c 26, Graham J Williams 46/48/45 http://togawarecom Copyright c 26, Graham J Williams 47/ Thank You http://togawarecom Copyright c 26, Graham J Williams 48/48/47