Introduction to Data Mining



Similar documents
Knowledge Discovery from Databases

Data Mining Solutions for the Business Environment

Database Marketing, Business Intelligence and Knowledge Discovery

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

CS590D: Data Mining Chris Clifton

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Knowledge Discovery Process and Data Mining - Final remarks

Dynamic Data in terms of Data Mining Streams

City University of Hong Kong. Information on a Course offered by Department of Management Sciences with effect from Semester A in 2010 / 2011

Introduction to Data Mining

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Subject Description Form

2.1. Data Mining for Biomedical and DNA data analysis

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

KNOWLEDGE BASE DATA MINING FOR BUSINESS INTELLIGENCE

Introduction to Data Mining

Comparison of K-means and Backpropagation Data Mining Algorithms

not possible or was possible at a high cost for collecting the data.

Data Mining and Marketing Intelligence

CRISP-DM: The life cicle of a data mining project. KDD Process

DATA MINING USING PENTAHO / WEKA

Introduction. A. Bellaachia Page: 1

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

DATA MINING ALPHA MINER

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Perspectives on Data Mining

A Brief Tutorial on Database Queries, Data Mining, and OLAP

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining course Master in Information Technologies Enginyeria Informàtica Tomàs Aluja. LIAM EIO. UPC Lluis Belanche LSI. UPC

DATA MINING AND WAREHOUSING CONCEPTS

Rhodes University COMPUTER SCIENCE HONOURS PROJECT Literature Review. Data Mining with Oracle 10g using Clustering and Classification Algorithms

Three Perspectives of Data Mining

Data Mining with Microsoft SQL Server 2005

Significance of Data Warehousing and Data Mining in Business Applications

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Data Mining is the process of knowledge discovery involving finding

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Academic Performance: An Approach From Data Mining

A Business Intelligence Training Document Using the Walton College Enterprise Systems Platform and Teradata University Network Tools Abstract

Data warehousing and data mining an overview

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining for Fun and Profit

Introduction Predictive Analytics Tools: Weka

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

An Introduction to WEKA. As presented by PACE

Principles of Data Mining

Prerequisites. Course Outline

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

CRISP-DM: Towards a Standard Process Model for Data Mining

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Data Mining Algorithms Part 1. Dejan Sarka

College information system research based on data mining

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

A Spatial Decision Support System for Property Valuation

Big Data. Introducción. Santiago González

Spatial Data Preparation for Knowledge Discovery

A Knowledge Management Framework Using Business Intelligence Solutions

Healthcare Measurement Analysis Using Data mining Techniques

Introduction to Data Mining Techniques

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Data Mining and Soft Computing. Francisco Herrera

Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining -

An Overview of Knowledge Discovery Database and Data mining Techniques

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Data Mining Applications in Fund Raising

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

The Prophecy-Prototype of Prediction modeling tool

City University of Hong Kong. Information on a Course offered by the Department of Management Sciences with effect from Semester A in 2012 / 2013

DATA MINING FOR SMALL STUDENT DATA SET KNOWLEDGE MANAGEMENT SYSTEM FOR HIGHER EDUCATION TEACHERS

Data Mining: Overview. What is Data Mining?

Data Mining Analytics for Business Intelligence and Decision Support

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Transcription:

Introduction to Data Mining José Hernández ndez-orallo Dpto.. de Systems Informáticos y Computación Universidad Politécnica de Valencia, Spain jorallo@dsic.upv.es Horsens, Denmark, 26th September 2005 1

Outline Motivation. BI: Old Needs, New Tools. Some DM Examples. Data Mining: Definition and Applications The KDD Process Data Mining Techniques Development and Implementation 2

CRISP-DM Methodology CRISP-DM (www.crisp-dm.org) (CRoss-Industry Standard Process for Data Mining) A company consortium (initially under the funding of the European Commission), which includes SPSS, NCR and DaimlerChrysler. 3

CRISP-DM Methodology Business Understanding: Understand the project goals and requirements from a business perspective. Substages: establishment of business objectives (initial context, objectives and success criteria), evaluation of the situation (resource inventory, requirements, assumptions and constraints, risks and contingences, terminology and costs and benefits), establishment of the data mining objectives (data mining objectives and success criteria) and, generation of the project plan (project plan and initial evaluation of tools and techniques). 4

CRISP-DM Methodology Data understanding: Collect and familiarise with data, identify the data quality problems and see the first potentialities or data subsets which might be interesting to analyse (according the business objectives from the previous stage). Substages: initial data gathering (gathering report), data description (description report), data exploration (exploration report) and data quality verification (quality information). 5

CRISP-DM Methodology Data preparation: The goal of this stage is to obtain the minable view. Here we find: integration, selection, cleansing and transformation. Substages: data selection (inclusion/exclusion reasons), data cleansing (data cleansing report), data construction (derived attributes, generated records), data integration (mixed data) and data formatting (reformatted data). 6

CRISP-DM Methodology Data modelling: It is the application of modelling techniques or data mining to the previous minable views. Substages: selection of the modelling technique (modelling technique, modelling assumptions), evaluation design (test design), model construction (chosen parameters, models, model description) and model evaluation (model measures, revision of the chosen parameters). 7

CRISP-DM Methodology Evaluation: It is necessary to evaluate (from the view point of the goal) the models of the previous stage. In other words, if the model is useful to answer some the business requirements. Substages: result evaluation (evaluation of the data mining results, approved models), revise the process (process revision) and, establishment of the following steps (list of possible actions, decisions). 8

CRISP-DM Methodology Deployment: The idea is to exploit the potential of the extracted models, integrate them in the decision-making processes of the organisation, spread reports about the extracted knowledge, etc. Substages: deployment planning (deployment plan), monitoring and maintenance planning (monitoring and maintenance plan), generation of the final report (final report, final presentation) and, project revision (documentation of the experience). 9

CRISP-DM Methodology Progressive implementation on an organisation: Planning and arrangement. Identify the data mining problems Identify the data mining problems Knowledge Extraction Iter. 1 Iter. 2 Model spread, deployment and exploitation Result Evaluation. Measurement of Costs and Benefits 10

Systems Elder Research, www.dataminglab.com 11

Systems Commercial DM Software: http://www.kdcentral.com/software/data_mining/ http://www.the-data-mine.com/bin/veiw/software/webindex Free: WEKA (http://www.cs.waikato.ac.nz/~ml/weka/) (Witten & Frank 1999) Rproject: free tool for statistical analysis (http://www.rproject.org/) Kepler: plug-in system from GMD (http://ais.gmd.de/kd/kepler.html). 12

Systems EXAMPLE: Clementine www.spss.com Tool that includes: Several data sources (ASCII, XLS and many DBMS through ODBC). Visual interface. Several data mining techniques: neural networks, decision trees, rules, a priori, regression, Data processing (pick & mix, combination and separation). Report and batch facilities. 13

EXAMPLE: Clementine Drug study http://www.pcc.qub.ac.uk/tec/courses/datamining/ohp/dm-ohp-final_3.html A number of hospital patients suffer a pathology which can be treated with a wide range of drugs. 5 different drugs are available. Patients respond differently to these drug. Problem: Systems Which drug is the most appropriate one for a new patient? 14

Systems EXAMPLE: Clementine. First step: DATA ACCESS: Read the data: e.g. a textfile with delimiters. The fields are named: age age sex gender BP blood pressure (High, Normal, Low) Cholesterol cholesterol (Normal, High) Na Sodium concentration in blood. K Potasium concentration in blood. drug drug to which the patient reacted satisfactorily. The attributes/variables can be combined: E.g. A new attribute (Na/K), can be added. 15

EXAMPLE: Clementine Systems Second Step: Familiarisation with the data. We visualise the records: 16

EXAMPLE: Clementine Systems Allows field selection and filtering. Can show graphically some data properties. E.g. : Which is the proportion of cases which reacted well to the drug? 17

EXAMPLE: Clementine Systems Can find relations. E.g: The relation between sodium and potasium is shown in a plot. We observe an apparently random distribution (except from drug Y) 18

Systems EXAMPLE: Clementine We can clearly observe that the patients with high Na/K quotient respond better to drug Y. But we want a classification model for every new patient, i.e.: Which is the best drug for each patient? Third step: Model construction Tasks performed in Clementine: Filter non-desired (irrelevant) attributes. Type the fields. Construct models (rules, decision trees, neural networks, ) 19

EXAMPLE: Clementine Systems This process is performed and graphically visualised in Clementine: From 2,000 examples the models are trained. 20

Systems EXAMPLE: Clementine Models can be browsed: The rules extend the same criterion which was discovered previously, i.e., drug Y for the patient with high Na/K ratio. But it also gives rules for the 21 rest.

Systems EXAMPLE: SAS ENTERPRISE MINER (EM) Suite that includes: Database connection (through ODBC and SAS datasets). Sampling and inclusion of derived variables. Data evaluation through dataset split into: training, validation (in case) and test. Different data mining techniques: decision trees, regression, neural network, clustering,... Model comparisons. Model conversion into SAS code. Graphical interface. Also includes tools for all the process flow: the stages can be repeated, modified and stored. 22

Systems EXAMPLE: SAS ENTERPRISE MINER (EM) (process flow, KDD) 23

Systems EXAMPLE: SAS ENTERPRISE MINER (EM) Selection and model assessment 24

Systems Weka, University of Waikato, New Zealand (cs.waikato.ac.nz) 25

Oracle: Business Intelligence and Data Mining tools Engine (Java DM) since Oracle 9i Suite (OracleBi Data Miner). OracleBI Spreadsheet Add-in Oracle Reports Services Spread Systems Define Hypotheses Oracle Enterprise Planning & Budgeting Oracle 10g (RDBMS with OLAP and DM) Model OracleActivit y Based Management Decide Oracle Balanced Scorecard OracleBI Data Miner OracleBI Discoverer Analyse Non-Oracle sources Source: IDC, 2004 OracleBI Warehouse Builder Trace Oracle E-Business Suite Act Oracle Daily Business Intelligence 26

Systems MS SQL SERVER: Analysis Services OLAP Services in SQL Server 97 was extended in SQL Server 2000 with DM features. This was called Analysis Services. Much more techniques included in the new SQL Server (2005). It is based on the OLE DB for Data Mining : an extension of the DB access protocol: OLE DB. Implements an SQL extension which works with DMM (Data Mining Model) and allows for: 1. Creating the model 2. Train the model 3. Make predictions 27

Trends 80s and early 90s: OLAP: statistics on predefined queries. The OLAP systems is useful to get complex reports quickly, to visualise the information, get plots and graphs and confirm hypotheses. Knowledge still extracted manually or has a statistical (summarisation) character. Data warehouses include only internal data. Late 90 Data-Mining: knowledge discovering. Machine learning and statistical modelling techniques are used to create novel patterns. Data warehouses mostly include internal data (and some external data). Early 00 Scoring techniques and simulation: discovering and use of global models.. Data warehouses include both internal and external data (economical, demographical, geographical indicators, etc.). 28

Mining Non-structured Data Web Mining refers to the global process of discovering information and knowledge which can be potentially useful and which is previously unknown from data on the web. (Etzioni 1996) Web Mining combines goals and techniques from different areas: Information Retrieval (IR) Natural Language Processing (NLP) Data Mining (DM) Databases (DB) WWW research Agent Technology There are several kinds of web mining: web content mining. web structure mining. web use mining. 29

General resources: KDcentral (www.kdcentral.com) The Data Mine (http://www.the-data-mine.com) Knowledge Discovery Mine (http://www.kdnuggets.com) Mailing list: KDD-nuggets: Journals: Data Mining and Knowledge Discovery. (http://www.research.microsoft.com/) Intelligent Data Analysis (http://www.elsevier.com/locate/ida) Associations: To know more... Some pointers ACM SIGDD (and the journal: explorations, http://www.acm.org/sigkdd/explorations/instructions.htm) 30

General Books Berry M.J.A.; Linoff, G.S. Mastering Data Mining Wiley 2000. Berthold, M.; Hand, D.J. (ed) Intelligent Data Analysis. An Introduction Second Edition, Springer 2002. Dunham, M.H. Data Mining. Introductory and Advanced Topics Prentice Hall, 2003. Fayyad, U.M.; Piatetskiy-Shapiro, G.; Smith, P.; Ramasasmy, U. Advances in Knowledge Discovery and Data Mining, MIT Press, 1996. Han, Jiawei; Micheline Kamber Data Mining: Concepts and Techniques Morgan Kaufmann, April 2000. Hand, David J.; Heikki Mannila and Padhraic Smyth Principles of Data Mining, The MIT Press, 2000. Hernández, J.; Ramírez, M.J.; Ferri, C. Introducción a la Minería de Datos, Prentice Hall / Addison Wesley, 2004. [*] Witten, I.H.; Frank, E. "Tools for Data Mining", Morgan Kaufmann, 1999. 31

More specific sources More specific or more technical Dzeroski, S.; Lavrac, N. Relational Data Mining Springer 2001. Etzioni, O. The World-Wide Web. Quagmire or Gold Mine Communications of the ACM, November 1996, Vol. 39, nº11, 1996. Fayyad, U.M.; Grinstein, G.G.; Wierse, A. (eds.) Information Visualization in Data Mining and Knowledge Discovery Morgan Kaufmann 2002. Mena, Jesus Data Mining Your Website, Digital Press, July 1999. Thuraisingham, B. Data Mining. Technologies, Techniques, Tools, and Trends, CRC Press, 1999. Wong, P.C. Visual Data Mining, Special Issue of IEEE Computer Graphics and Applications, Sep/Oct 1999, pp. 20-46. 32

QUESTIONS? 33