Data Analytics and Business Intelligence (8696/8697)



Similar documents
How To Understand Data Mining In R And Rattle

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Introduction. A. Bellaachia Page: 1

Information Management course

Introduction to Data Mining

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Data Science with R. Introducing Data Mining with Rattle and R.

Perspectives on Data Mining

Data Mining for Fun and Profit

Data Mining: Overview. What is Data Mining?

A Data Mining Tutorial

MS1b Statistical Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Azure Machine Learning, SQL Data Mining and R

Data Mining System, Functionalities and Applications: A Radical Review

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Data Analytics and Business Intelligence (8696/8697)

Introduction to Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Machine Learning using MapReduce

Data Mining: Benefits for business.

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Data Mining Analytics for Business Intelligence and Decision Support

Introduction to Data Mining

AMIS 7640 Data Mining for Business Intelligence

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

A Review of Data Mining Techniques

Data Mining + Business Intelligence. Integration, Design and Implementation

Machine Learning with MATLAB David Willingham Application Engineer

Role of Social Networking in Marketing using Data Mining

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Database Marketing, Business Intelligence and Knowledge Discovery

Data Warehousing and Data Mining

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Warehousing and Data Mining in Business Applications

Sanjeev Kumar. contribute

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Data Mining and Machine Learning in Bioinformatics

Session 10 : E-business models, Big Data, Data Mining, Cloud Computing

Introduction to Data Mining

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Machine Learning: Overview

Data Mining for Business Analytics

An Overview of Knowledge Discovery Database and Data mining Techniques

An Introduction to Advanced Analytics and Data Mining

Obtaining Value from Big Data

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

ISSN: (Online) Volume 3, Issue 7, July 2015 International Journal of Advance Research in Computer Science and Management Studies

Using multiple models: Bagging, Boosting, Ensembles, Forests

An Empirical Study of Application of Data Mining Techniques in Library System

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Social Media Mining. Data Mining Essentials

Dan French Founder & CEO, Consider Solutions

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Warehousing and Data Mining

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Distributed forests for MapReduce-based machine learning

Predictive Analytics Certificate Program

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Introduction

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Machine Learning Capacity and Performance Analysis and R

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Master of Science in Health Information Technology Degree Curriculum

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Anomaly Detection and Predictive Maintenance

TNS EX A MINE BehaviourForecast Predictive Analytics for CRM. TNS Infratest Applied Marketing Science

AMIS 7640 Data Mining for Business Intelligence

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Statistics for BIG data

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

TIETS34 Seminar: Data Mining on Biometric identification

Using Data Mining for Mobile Communication Clustering and Characterization

A Big Data Workshop introducing a smarter decision methodology for your organisation through advanced analytics and career opportunity

White Paper. Data Mining for Business

Navigating Big Data business analytics

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

An Introduction to Data Mining

Pentaho Data Mining Last Modified on January 22, 2007

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Transcription:

http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/39 Data Analytics and Business Intelligence (8696/8697) Introducing Data Mining Graham.Williams@togaware.com Chief Data Scientist Australian Taxation Office Adjunct Professor, Australian National University Adjunct Professor, University of Canberra Fellow, Institute of Analytics Professionals of Australia Graham.Williams@togaware.com http://datamining.togaware.com

http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 2/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Introductions http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 3/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 4/39 A Little History Database Research where to now? Machine Learning algorithms for symbolic learning. Statistics long heritage and role of sampling. Data Mining is one of the most important core technologies being developed today, in the same league as medical research, energy research, and environmental research. (David Hand, 2006) Data mining is a core technology underlying new developments in medicine, biotechnology, finance, environmental research.

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 5/39 Data Mining in Practise Data Mining research - CSIRO 1995 Data Mining in practise - Health Insurance Commission 1995 Data Mining projects with: Esanda Finance NRMA Mount Stromlo Health Insurance Commission Commonwealth Bank Department of Health Australian Taxation Office Australian Customs Service Department of Veteran Affairs...

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 5/39 Data Mining in Practise Data Mining research - CSIRO 1995 Data Mining in practise - Health Insurance Commission 1995 Data Mining projects with: Esanda Finance NRMA Mount Stromlo Health Insurance Commission Commonwealth Bank Department of Health Australian Taxation Office Australian Customs Service Department of Veteran Affairs...

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 5/39 Data Mining in Practise Data Mining research - CSIRO 1995 Data Mining in practise - Health Insurance Commission 1995 Data Mining projects with: Esanda Finance NRMA Mount Stromlo Health Insurance Commission Commonwealth Bank Department of Health Australian Taxation Office Australian Customs Service Department of Veteran Affairs...

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 5/39 Data Mining in Practise Data Mining research - CSIRO 1995 Data Mining in practise - Health Insurance Commission 1995 Data Mining projects with: Esanda Finance NRMA Mount Stromlo Health Insurance Commission Commonwealth Bank Department of Health Australian Taxation Office Australian Customs Service Department of Veteran Affairs...

Introductions Beginning Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 6/39 Data Mining in Use Today Amazon What else should I buy ebay How to support sellers Facebook and LinkedIn Who should I be firends with? Google Who, what, when, how GoolgeNow Finance Bank products and fraud Health care Best treatment Insurance Fraud Government Better Services and Compliance

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 7/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 8/39 Data is Fundamental Sherlock Holmes: It is a capital mistake to theorize before one has data. Insensibly, one begins to twist facts to suit theories, instead of theories to suit facts. A Scandal in Bohemia (1891) Arthur Conan Doyle

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 9/39 The Economist, 21 September 2006 Combine dozens of clues to spot suspicious patterns in mountains of transactions. Car Insurance: the Monday morning fraudsters separating the genuine from the fraud Claimant nearly injured (driver side impact) low Low resale value high Low resale + own a luxury car low Low resale + own a luxury car expired insurance no Angry call after claim demanding action high Customer calls after the 20th of the month low But so many combinations!

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 10/39 Credit Card Fraud Association for Payment Clearing Services Buy petrol and pay by card using a machine hmmm... Soon after buy a diamond ring very high Purchase sports shoes marginally high Purchase two higher Teenage sizes high City has vibrant black market highest Professional fraudsters are innovative and dynamic we need to monitor and score individual cards and transactions. Block cards with sudden spikes in purchases.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 11/39 Digital Footprints We leave behind us, every day, a growing digital footprint. Store Purchase - loyalty cards and credit cards Building Access Computer Login etoll Records Mobile Phone Cameras with sophisticated image recognition We need due diligence in collection and analysis of data privacy protocols.

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 12/39 Taxation Records - Historic Perspective Data Collection 5500 years ago Sumerian (Iraq) and Elam (Iran) peoples marked their tax records onto dried mud tablets. Data Analysis Since then people have sought ways to add value to this otherwise statically recorded information either for good or bad! Source http://www.crystalinks.com/cuneiformtablets.html

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 12/39 Taxation Records - Historic Perspective Data Collection 5500 years ago Sumerian (Iraq) and Elam (Iran) peoples marked their tax records onto dried mud tablets. Data Analysis Since then people have sought ways to add value to this otherwise statically recorded information either for good or bad! Source http://www.crystalinks.com/cuneiformtablets.html

Setting the Scene http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 12/39 Taxation Records - Historic Perspective Data Collection 5500 years ago Sumerian (Iraq) and Elam (Iran) peoples marked their tax records onto dried mud tablets. Data Analysis Since then people have sought ways to add value to this otherwise statically recorded information either for good or bad! Source http://www.crystalinks.com/cuneiformtablets.html

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 13/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 14/39 Machine Learning as Model Building Data Mining is a data driven approach to understanding the world. Deploys machine learning and statistical modelling approaches to build models from large data collections. We are aiming to build models of the real world, as an architect would build models to get a better understanding of how things will look.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 14/39 Machine Learning as Model Building Data Mining is a data driven approach to understanding the world. Deploys machine learning and statistical modelling approaches to build models from large data collections. We are aiming to build models of the real world, as an architect would build models to get a better understanding of how things will look.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 14/39 Machine Learning as Model Building Data Mining is a data driven approach to understanding the world. Deploys machine learning and statistical modelling approaches to build models from large data collections. We are aiming to build models of the real world, as an architect would build models to get a better understanding of how things will look.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 15/39 Data Mining as Model Building Build models of the world (regression, decision trees, neural networks, association rules, fuzzy systems, random forests, support vector machines, boosted stumps,... ) from data that represent snippets of information about the world. Use these models to understand and discover patterns of interest that may provide knowledge deployable in improving business processes. The non-trivial extraction of novel, implicit, and actionable knowledge from large databases and in a timely manner. Knowledge from data any way you can. Technology to enable data exploration, data analysis, and data visualisation of very large databases at a high level of abstraction, without a specific hypothesis in mind.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 15/39 Data Mining as Model Building Build models of the world (regression, decision trees, neural networks, association rules, fuzzy systems, random forests, support vector machines, boosted stumps,... ) from data that represent snippets of information about the world. Use these models to understand and discover patterns of interest that may provide knowledge deployable in improving business processes. The non-trivial extraction of novel, implicit, and actionable knowledge from large databases and in a timely manner. Knowledge from data any way you can. Technology to enable data exploration, data analysis, and data visualisation of very large databases at a high level of abstraction, without a specific hypothesis in mind.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 15/39 Data Mining as Model Building Build models of the world (regression, decision trees, neural networks, association rules, fuzzy systems, random forests, support vector machines, boosted stumps,... ) from data that represent snippets of information about the world. Use these models to understand and discover patterns of interest that may provide knowledge deployable in improving business processes. The non-trivial extraction of novel, implicit, and actionable knowledge from large databases and in a timely manner. Knowledge from data any way you can. Technology to enable data exploration, data analysis, and data visualisation of very large databases at a high level of abstraction, without a specific hypothesis in mind.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 15/39 Data Mining as Model Building Build models of the world (regression, decision trees, neural networks, association rules, fuzzy systems, random forests, support vector machines, boosted stumps,... ) from data that represent snippets of information about the world. Use these models to understand and discover patterns of interest that may provide knowledge deployable in improving business processes. The non-trivial extraction of novel, implicit, and actionable knowledge from large databases and in a timely manner. Knowledge from data any way you can. Technology to enable data exploration, data analysis, and data visualisation of very large databases at a high level of abstraction, without a specific hypothesis in mind.

Terminology http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 16/39 Data Mining Adds Value to Data Competitive business environment knowledge is power Improved knowledge for better decision support Volumes of data unexplored wealth of information Data sources accessible massive data warehouses Available data mining tools and computational power

Applications http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 17/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Applications http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 18/39 Business Problems Customer Profiling Customer Segmentation Direct Marketing Customer Retention Basket Analysis Fraud Detection Compliance Adverse Drug Reactions...

Applications http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 19/39 Customer Relationship Management All-of-business view of customer E-commerce providing significant data for mining Amazon.com a data mining company! ATO (and other Government Departments) CRM through data mining

Applications http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 20/39 Fraud Disguised amongst the mass of data A small percentage of all transactions A small percentage of a very large budget 0.1% of $1billion = $1million NRMA: Fraud costs $70 per policy Dob in a fraud: Canberra Times 16 Aug 2003 ATO and serious non-compliance

Applications http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 21/39 Characteristics Extremely large databases Discovery of the non-obvious No specific hypothesis c.f. statistical hypothesis testing Useful knowledge to improve processes Too large to be done manually

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 22/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 23/39 Technologies Machine Learning Database Web Services High Performance Computers Data Mining Statistics Computational Mathematics Intelligent Systems

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 24/39 A Framework for Data Mining A Formal Framework with Three Elements: Knowledge Representation - Language and Sentences Measure of Goodness Heuristic Search - near infinite search space

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 25/39 Machine Learning Representing Knowledge Languages: Decision Trees Classification Rules Neural Networks Support Vector Machine K Nearest Neighbours Cluster Centroids Association Rules Heuristic search for the best sentences (models) in the language by some measure (criteria of best model). 0.568 Age... Gender Distance Income... GP Visits 0.825 $ Risk Age 0.257 0.723 Hospital? 0.184 0.419 Support Vectors

Technologies Statistics Supporting sampling and uncertainty and modelling: Data, Counting, Probabilities, Hypothesis testing Exploratory data analysis Predictive models: CART, MARS, PRIM Statistical thinking and avoiding data dredging Computational requirements sampling Data mining attempts to avoid sampling Data mining: hypothesis generation c.f. hypothesis testing Be aware of the Bonferroni correction: testing 2 hypotheses, then use a p-value threshold of 0.025. http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 26/39

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 27/39 Visualisation Data visualisation is key to understanding our data, its distribution, and models Exploratory data analysis has long history in Statistics, now significantly enhanced by computer science Visualisation in Data Mining Understanding the data Visualising the process Visualising the results of mining See Minard s visualisation of Napolean s march on Moscow.

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 28/39 High Performance Computing Most algorithms are computationally expensive Very large datasets versus slow algorithms Parallel computers Multiprocessor computers Moving back to 64bit processors 32bit is limited to 4GB of data 64bit is limited to 16 EB (exabytes) (16,000 PB = 16,000,000 TB = 16,000,000,000 GB)

Technologies http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 29/39 Software Engineering SE Practices apply to DM practises Model development process Repeatability and Evidence Developing APIs for data and models SOAP Delivery of data mining through web services Interoperability through PMML (XML) Predictive Modelling Markup Language

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 30/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 31/39 Techniques Data mining can be descriptive or predictive. Descriptive: Identify human-interpretable patterns. Visualisation Concept Description Cluster Analysis Association Analysis Sequential Patterns Anomaly Detection Predictive: Predict unknown or future outcomes. Regression predict a continuous quantity Classification predict class of entities

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 32/39 Concept Description Characterise and discriminate between different concepts (classes or groups) of entities (customers) Example: Amazon.com interested in the characteristics of people who tend to purchase many books so that they can provide a better service for them. Characterisation identifies general characteristics or features of a target class of data. Discrimination compares features of different classes for those that differentiate.

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 33/39 Cluster Analysis Identify groups within data that maximise intraclass similarity and minimises interclass similarity. Example: Cluster all customers based on personal characteristics and products they ve purchased. Building models from unlabelled data: unsupervised learning.

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 34/39 Association Analysis Identify attribute-value conditions that occur frequently together in a given set of data. Example: Age [20, 30] Income [$20K, $30K] MP3player Support=2%; Confidence=60%

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 35/39 Anomaly Detection Also referred to as Outlier Analysis Identify entities that do not comply with the general behaviour or model of data. Exceptions to the rule! Example: Odd patterns can be easily hidden amongst 10 million transactions, but may be indicative of fraud.... but not likely find those who live at the edge

Techniques http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 36/39 Classification Find a model which describes and distinguishes data classes or concepts for the purpose of using the model to predict the class of previously unseen entities. Example: Build a model to predict whether customers are likely to purchase a download of a particular MP3 music file. Building models from labelled data: supervised learning.

Issues http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 37/39 Overview 1 Introductions 2 Setting the Scene 3 Data Mining Terminology 4 Data Mining Applications 5 Data Mining Technologies 6 Data Mining Techniques 7 Issues

Issues http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 38/39 Data Challenges 1 Scalability 2 Dimensionality 3 Complex and Heterogeneous Data 4 Data Quality 5 Data Matching and Linking 6 Privacy 7 Streaming: Anytime Data Mining give me the answer now

Summary Introducing Data Mining http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 39/39 Lecture Summary Data Mining is about Analysing Data; Technology that is all pervasive today; Draws on many disciplines; Key disciplines of Statistics and Machine Learning. This document, sourced from DataMiningL.Rnw revision 436, was processed by KnitR version 1.6 of 2014-05-24 and took 1.1 seconds to process. It was generated by gjw on nyx running Ubuntu 14.04 LTS with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 4 cores and 12.3GB of RAM. It completed the processing 2014-06-21 20:22:53.