IRDS: Data Mining Process. Charles Sutton University of Edinburgh

Similar documents
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Introduction to Data Mining

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Perspectives on Data Mining

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining + Business Intelligence. Integration, Design and Implementation

Maschinelles Lernen mit MATLAB

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining System, Functionalities and Applications: A Radical Review

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

TIETS34 Seminar: Data Mining on Biometric identification

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining. Data Analysis and Knowledge Discovery

Machine Learning Capacity and Performance Analysis and R

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO)

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining: Overview. What is Data Mining?

Data Mining for Fun and Profit

Data Mining for Digital Forensics

Performing a data mining tool evaluation

A Methodology Based on Business Intelligence for the Development of Predictive Applications in Self-Adapting Environments

Assessing Data Mining: The State of the Practice

CAS CS 565, Data Mining

Data Mining and Application in Accounting and Auditing

Data Mining Applications in Higher Education

The University of Jordan

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining: An Introduction

Introduction to Data Mining

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Why do statisticians "hate" us?

Azure Machine Learning, SQL Data Mining and R

OUTLIER ANALYSIS. Data Mining 1

An Introduction to Advanced Analytics and Data Mining

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Introduction. A. Bellaachia Page: 1

Data Mining Analytics for Business Intelligence and Decision Support

Machine Learning Introduction

Machine Learning: Overview

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Machine Learning and Statistics: What s the Connection?

Steven C.H. Hoi School of Information Systems Singapore Management University

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Tom Khabaza. Hard Hats for Data Miners: Myths and Pitfalls of Data Mining

Data, Measurements, Features

Investigating Data Mining in MATLAB

Framing Business Problems as Data Mining Problems

Knowledge Acquisition in Databases

Data Isn't Everything

Dynamic Data in terms of Data Mining Streams

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Collaborative Filtering. Radek Pelánek

20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

The Data Mining Process

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Learning is a very general term denoting the way in which agents:

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Data Mining and Machine Learning in Bioinformatics

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

CUSTOMER Presentation of SAP Predictive Analytics

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data mining life cycle in fraud auditing

MA2823: Foundations of Machine Learning

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

How To Understand The History Of Navigation In French Marine Science

How To Understand The Data Mining Process Of Andsl

Certified Information Professional 2016 Update Outline

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Data Mining Techniques and Opportunities for Taxation Agencies

Video and Image Analytics for Business, Consumer and Social Insights. Jialie Shen School of Information Systems, SMU

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Machine Learning What, how, why?

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Statistical Models in Data Mining

Audit Data Analytics. Bob Dohrer, IAASB Member and Working Group Chair. Miklos Vasarhelyi Phillip McCollough

SPATIAL DATA CLASSIFICATION AND DATA MINING

Qi Liu Rutgers Business School ISACA New York 2013

The Scientific Data Mining Process

Transcription:

IRDS: Data Mining Process Charles Sutton University of Edinburgh

Data Science Our working definition Data science is the study of the computational principles, methods, and systems for extracting knowledge from data. A relatively new term. A lot of current hype If you have to put science in the name Component areas have a long history machine learning databases statistics optimization natural language processing computer vision speech processing applications to science, business, health. Difficult to find another term for this intersection

The term data mining Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel ways that are both understandable and useful to the data owner. Hand, Mannila, Smyth, 2001

The term data mining Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel ways that are both understandable and useful to the data owner. Hand, Mannila, Smyth, 2001 not collected for the purpose of your analysis

The term data mining Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel ways that are both understandable and useful to the data owner. Hand, Mannila, Smyth, 2001 Many easy patterns already known e.g., pregnant example from association rule mining

The term data mining Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarise the data in novel ways that are both understandable and useful to the data owner. Hand, Mannila, Smyth, 2001 Tradeoff between predictive performance human interpretability Ex: neural networks vs decision trees

Before I get too far ahead of myself

What problem am I trying to solve?

Problem Types Visualization Prediction: Learn a map x > y Classification: Predict categorical value Regression: Predict a real value Others Collaborative filtering Learning to rank Structured prediction Description Clustering Dimensionality reduction Density estimation Finding patterns Association rule mining Detecting anomalies / outliers supervised learning unsupervised learning

Prediction Examples Classification Advertising Ex: Given the text of an online advertisement and a search engine query, predict whether a user will click on the ad Document classification Ex: Spam filtering Object detection Ex: Given an image patch, dose it contain a face? Regression Predict the final vote in an election (or referendum) from polls Predict the temperature tomorrow given the previous few days Sometimes augmented with other structure / information Structured prediction Spatial data, Time series data Ex: Predicting coding regions in DNA Collaborative filtering (Amazon, Netflix) Semi-supervised learning

Description Examples Clustering Assign data into groups with high intra-group similarity (like classification, except without examples of correct group assignments) Ex: Cluster users into groups, based on behaviour Social network analysis Autoclass system (Cheeseman et al. 1988) discovered a new type of star, Dimensionality reduction Eigenfaces Topic modelling Discovering graph structure Ex: Transcription networks Ex: JamBayes for Seattle traffic jams Association rule mining Market basket data Computer security

Data Science Process Preparation Understanding the problem What are your goals? What is the signal in the data? What would success be? Is it likely? Collect data Think about / mitigate biases: selection, non-response, etc. Think about causes of noise or missing data Data preparation Data wrangling Scraping Data integration Data cleaning Data verification, scrubbing, auditing Data exploration Exploratory visualisation

Data Science Process (part 2) Development and Debugging Feature engineering Model building Choose algorithm Train / fit to data (possibly parallel, distributed, GPU) Hyperparameter exploration Evaluation and debugging Evaluation metrics and procedures (e.g., cross-validation, temporal or stratified sampling of development sets) Developing diagnostics Revising previous steps in process Sensitivity to model parameters Evaluative visualization

Data Science Process (parts 3a and 3b) Automated Systems Deploying model Serving infrastructure Glue and pipelines Monitoring Detecting concept drift Feedback effects direct and indirect Upstream configuration and format changes Hidden dependencies Human Interpretation Interpretation results Communication with users / domain experts Writing reports Presentation visualizations Evangelisation Building trust Revising analysis Communication with decision makers

Roadmap In the next few weeks, we ll talk about Visualization Feature extraction Evaluation and debugging

Further Reading The process flowchart combines ideas from: D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, 2015. K. Wagstaff. Machine learning that matters. In International Conference on Machine Learning (ICML), 2012. P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0 Step-by-step data mining guides. 2000. These readings are not examinable. (there is no exam for this course)