Data Mining Part 5. Prediction



Similar documents
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining for Knowledge Management. Classification

Data Mining Part 5. Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Customer Classification And Prediction Based On Data Mining Technique

Classification and Prediction

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

DATA MINING TECHNIQUES AND APPLICATIONS

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Introduction to Data Mining

Data Mining - Evaluation of Classifiers

Data Mining Algorithms Part 1. Dejan Sarka

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Learning is a very general term denoting the way in which agents:

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Knowledge Discovery and Data Mining

Machine Learning: Overview

not possible or was possible at a high cost for collecting the data.

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Classification algorithm in Data mining: An Overview

Knowledge Discovery and Data Mining

Information Management course

Data Mining: Overview. What is Data Mining?

Social Media Mining. Data Mining Essentials

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining. Nonlinear Classification

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Data, Measurements, Features

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Azure Machine Learning, SQL Data Mining and R

Knowledge Discovery and Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Classification: Decision Trees

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Data Mining Analytics for Business Intelligence and Decision Support

Predictive Data modeling for health care: Comparative performance study of different prediction models

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

from Larson Text By Susan Miertschin

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining Applications in Higher Education

Data Mining for Business Analytics

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Introduction to Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Professor Anita Wasilewska. Classification Lecture Notes

How To Solve The Kd Cup 2010 Challenge

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Preprocessing. Week 2

Decision Trees from large Databases: SLIQ

An Introduction to Data Mining

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Chapter 12 Discovering New Knowledge Data Mining

A Survey of Classification Techniques in the Area of Big Data.

How To Use Neural Networks In Data Mining

Intrusion Detection. Jeffrey J.P. Tsai. Imperial College Press. A Machine Learning Approach. Zhenwei Yu. University of Illinois, Chicago, USA

Data Mining and Machine Learning in Bioinformatics

Introduction to Pattern Recognition

Classification and Prediction

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

LVQ Plug-In Algorithm for SQL Server

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

SVM Ensemble Model for Investment Prediction

Data Mining 5. Cluster Analysis

Use of Data Mining in Banking

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Data Mining Solutions for the Business Environment

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Specific Usage of Visual Data Analysis Techniques

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

NEURAL NETWORKS IN DATA MINING

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Data Mining: An Introduction

Data Mining Individual Assignment report

Machine Learning Capacity and Performance Analysis and R

Principles of Data Mining by Hand&Mannila&Smyth

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Discretization and grouping: preprocessing steps for Data Mining

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Rule based Classification of BSE Stock Data with Data Mining

Machine Learning using MapReduce

Improving spam mail filtering using classification algorithms with discretization Filter

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Supervised Feature Selection & Unsupervised Dimensionality Reduction

CS Introduction to Data Mining Instructor: Abdullah Mueen

Framing Business Problems as Data Mining Problems

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

Transcription:

Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini

Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References

Classification vs. Numeric Prediction

Classification vs. Numeric Prediction Prediction problems predict future data trends Major types: Classification Numeric prediction

Classification vs. Numeric Prediction Classification a model or classifier to predict categorical labels (discrete or nominal) The ordering among categories has no meaning e.g. such as safe or risky for the loan application data guess whether a customer with a given profile will buy a new computer.

Classification vs. Numeric Prediction Numeric Prediction a model or predictor to predict a continuous-valued function or ordered value e.g. predicting how much a given customer will spend during a sale at AllElectronics Regression analysis is a statistical methodology that is most often used for numeric prediction

Typical applications Credit approval Target marketing Medical diagnosis Fraud detection Performance prediction Manufacturing Applications

Classification Techniques for data classification: Decision tree classifiers Bayesian classifiers Bayesian belief networks Rule-based classifiers Backpropagation (a neural network technique) Support vector machines K-nearest-neighbor classifiers Case-based reasoning Genetic algorithms

Prediction Process

Prediction Process Prediction is a two-step process: Learning step or model construction Model usage

Model Construction Learning step (model construction) A classification algorithm builds the classifier by analyzing or learning from a training set Each instance is assumed to belong to a predefined class, as determined by the class label attribute The set of instances used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formula Data instances can be referred to as samples, examples, instances, data points, objects, or data tuples

Model Construction Example: identify loan applications as being either safe or risky

Model Usage Model usage: for classifying future or unknown objects

Estimate Accuracy Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model The test instances are randomly selected from the general data set Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify new data instances whose class labels are not known

Numeric Prediction Numeric prediction is a two step process, similar to that of classification The attribute for which values are being predicted is continuous-valued (ordered) rather than categorical (discrete-valued and unordered). This attribute can be referred to simply as the predicted attribute. Example: We want to predict the amount (in dollars) that would be safe for the bank to loan an applicant. We use the continuous-valued loan_amount as the predicted attribute, and build a predictor for our task.

Supervised vs. Unsupervised Learning Supervised learning The class label of each training instance is known Learning step is also known as supervised learning i.e., the learning of the classifier is supervised in that it is told to which class each training instance belongs Unsupervised learning (clustering) The class label of each training instance is not known The number or set of classes to be learned may not be known in advance the aim of establishing the existence of classes or clusters in the data

Data Preparation

Data Preparation The preprocessing may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification process. The preprocessing steps: Data cleaning Relevance analysis Data transformation

Data cleaning Data Cleaning to remove or reduce noisy values the treatment of missing values Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning.

Relevance Analysis (feature selection) Relevance analysis: Correlation analysis to detect redundant attributes Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. Attribute subset selection to remove irrelevant attributes to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.

Data transformation Normalization Discretization Generalization Data Transformation

Normalization Data Transformation The normalization is used particularly when methods involving distance measurements are used in the learning step. The values a given attribute fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0. Normalization would prevent attributes with initially large ranges (like income) from outweighing attributes with initially smaller ranges (such as binary attributes).

Discretization Data Transformation For example, numeric values for the attribute income can be generalized to discrete ranges, such as low, medium, and high. Similarly, categorical attributes Generalization The data can also be transformed to higher-level concepts. Example: like street, can be generalized to higherlevel concepts, like city. Because generalization compresses the original training data, fewer input/output operations may be involved during learning.

Comparing Prediction Methods

Comparing Prediction Methods Classification and prediction methods can be compared and evaluated according to the following criteria: Accuracy the ability of a given classifier to correctly predict the class label of new data Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness the ability of the classifier to make correct predictions given noisy data or data with missing values.

Comparing Prediction Methods Scalability The ability to construct the classifier or predictor efficiently given large amounts of data. Interpretability the level of understanding and insight that is provided by the classifier.

References

References J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier Inc. (2006). (Chapter 6)

The end