The KDD Process: Applying Data Mining



Similar documents
Chapter 12 Discovering New Knowledge Data Mining

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Classification and Prediction

Social Media Mining. Data Mining Essentials

Data Mining Classification: Decision Trees

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

8. Machine Learning Applied Artificial Intelligence

Data mining techniques: decision trees

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Decision Trees from large Databases: SLIQ

CLOUDS: A Decision Tree Classifier for Large Datasets

A Review of Data Mining Techniques

Data Mining and Neural Networks in Stata

Machine Learning and Data Mining -

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

How To Use Neural Networks In Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Artificial Neural Network Approach for Classification of Heart Disease Dataset

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining: Foundation, Techniques and Applications

An Overview of Knowledge Discovery Database and Data mining Techniques

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Classification On The Clouds Using MapReduce

Weather forecast prediction: a Data Mining application

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Knowledge Discovery and Data Mining

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Decision-Tree Learning

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

ARTIFICIAL INTELLIGENCE METHODS IN EARLY MANUFACTURING TIME ESTIMATION

6.2.8 Neural networks for data mining

Data Mining based on Rough Set and Decision Tree Optimization

Data Mining for Knowledge Management. Classification

Adaptive Business Intelligence (ABI): Presentation of the Unit

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Customer Classification And Prediction Based On Data Mining Technique

Data Mining on Streams

A Data Mining Tutorial

Spatial Data Mining Methods and Problems

Visualization of Breast Cancer Data by SOM Component Planes

Data Mining with Weka

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations

Data Mining of Web Access Logs

D A T A M I N I N G C L A S S I F I C A T I O N

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining Part 5. Prediction

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Practical Machine Learning Tools and Techniques

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining Techniques Chapter 7: Artificial Neural Networks

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Efficient Integration of Data Mining Techniques in Database Management Systems

Data Mining Techniques Chapter 6: Decision Trees

Web Mining using Artificial Ant Colonies : A Survey

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

A New Approach for Evaluation of Data Mining Techniques

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Data Mining Algorithms Part 1. Dejan Sarka

Applying Data Mining Technique to Sales Forecast

Introduction to Learning & Decision Trees

Data Preprocessing. Week 2

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Decision Tree Learning on Very Large Data Sets

Specific Usage of Visual Data Analysis Techniques

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Comparative Analysis of Serial Decision Tree Classification Algorithms

Database Marketing, Business Intelligence and Knowledge Discovery

Grid e-services for Multi-Layer SOM Neural Network Simulation

Performance Analysis of Decision Trees

Active Learning with Boosting for Spam Detection

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Techniques

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Data Mining Analytics for Business Intelligence and Decision Support

Knowledge-based systems and the need for learning

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Transcription:

The KDD Process: Applying Nuno Cavalheiro Marques (nmm@di.fct.unl.pt) Spring Semester 2010/2011 MSc in Computer Science

Outline I 1 Knowledge Discovery in Data beyond the Computer 2 by Visualization Lift and ROC Charts Multidimensional Data visualization 3 Decision Trees for Representation Information Gain Hypothesis Space Issues SLIQ: A Fast Scalable Classifier for 4 with SOM SOM Training SOM Visualization and Clustering Parallel SOM[SM07]

Outline II 5 References

beyond the Computer The Tabulating Machine

KDD Visualization DMDT DMSOM References beyond the Computer KDD is Interactive... KDD process Knowledge Models Base De Dados, Textos Visualization Input data Clean Data Target Data agregation Preprocessing and cleaning Selection and sampling Data Warehousing

Multidimensional Data visualization Information Visualization and Related Topics Please check PDF file InformationVisualization.

Representation Decision Tree for PlayTennis (in [M97]) Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Information Gain Definition I Gain(S, A) = expected reduction in entropy due to sorting on A Gain(S, A) Entropy(S) v Values(A) S v S Entropy(S v )

Hypothesis Space Search by ID3 + + A1 + + + A2 + +...... A2 + + A3 + A2 + + A4......

Hypothesis Space Properties of ID3 Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can t play 20 questions... No back tracking Local minima... Statisically-based search choices Robust to noisy data... Inductive bias: approx prefer shortest tree

Hypothesis Space Inductive Bias in ID3 Note H is the power set of instances X Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam s razor: prefer the shortest hypothesis that fits the data

Issues Gini Index or Entropy? gini(t ) = 1 p 2 j 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 0,5 1 1,5 Gini H(X)

Issues Continuous Valued Attributes I Create a discrete attribute to test continuous Temperature = 82.5 (Temperature > 72.3) = t, f Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No

Issues Attributes with Many Values I Problem: If attribute has many values, Gain will select it Imagine using Date = Jun 3 1996 as attribute One approach: use GainRatio instead GainRatio(S, A) SplitInformation(S, A) Gain(S, A) SplitInformation(S, A) c i=1 where S i is subset of S for which A has value v i S i S log S i 2 S

Issues Unknown Attribute Values What if some examples missing values of A? Use training example anyway, sort through tree If node n tests A, assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability p i to each possible value v i of A assign fraction p i of example to each descendant in tree Classify new examples in same fashion

SLIQ: A Fast Scalable Classifier for SLIQ: A Fast Scalable Classifier for Please check PDF file for SLIQ presentation...

Basic Model for Self-Organizing Maps (SOM) Basic equations c = argmin i ( x m i ) m i (t + 1) = m i (t) + h ci (t)[x(t) m i (t)] h ci (t) - function for creating the (usually 2D) map effect, relating nearby neurons

SOM Training Competitive Learning Web effect ([?])

SOM Visualization and Clustering Data Visualization and Clustering in SOM UCI s Credit [?] UCI s Adult [SM07]

Parallel SOM[SM07] Training SOM with two phases Topological order small number of epochs Convergence big number of epochs Basic idea: explore the two phase behaviour Figure: SOM train example from [?]

Parallel SOM[SM07] Hybrid Algorithm [SM07] Merge advantages of Network-Partition and Data-Partition algorithms Take advantage of two phases while training SOM Algorithm During Topological order Simple data-partition method During Convergence Hybrid mode for segmenting patterns and map every X epochs

Parallel SOM[SM07] Segmenting the hybrid algorithm Segmenting the map and patterns with histogram Need to measure segment sample migration Figure: Asymmetrical segmentation example

Parallel SOM[SM07] Goal: Qualitative validation of topological information Figure: DS Chainlink and U-Matrix Figure: U-Matrices with hybrid algorithm

Main References I Mitchell, T.M.: Machine Learning. McGraw-Hill (March 1997) Manish Mehta, Rakesh Agrawal and Jorma Rissanen, SLIQ: A Fast Scalable Classifier for, in Advances in Database Technology, LNCS, Vol: 1057/1996.

Main References II Bruno Silva and Nuno Marques. A hybrid parallel som algorithm for large maps in data-mining. In José Neves, Manuel Filipe Santos, and José Machado, editors, New Trends in Artificial Intelligence, Guimarães. Portugal, December 2007. Associação Portuguesa para a Inteligência Artificial (APPIA).