Part I Data Mining Fundamentals
Data Mining: A First View Chapter 1
1.11 Data Mining: A Definition
Data Mining The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
Induction-based Learning The process of forming general concept definitions by observing specific examples of concepts to be learned.
Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. Data mining is one step of the KDD process.
1.2 What Can Computers Learn?
Four Levels of Learning Facts Concepts Procedures Principles i
Facts A fact is a simple statement of truth.
Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics.
Procedures A procedure is a step-by-step course of action to achieve a goal.
Principles A principles are general truths or laws that are basic to other truths.
Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session.
Three Concept Views Classical View Probabilistic View Exemplar View
Classical View All concepts have definite defining gproperties. p
Probabilistic View People store and recall concepts as generalizations created by observations.
Exemplar View People store and recall likely concept exemplars that are used to classify unknown instances.
Supervised Learning Build a learner model using data instances of known origin. Use the model to determine the outcome for new instances of unknown origin.
Supervised Learning: A Decision Tree Example
Decision Tree A tree structure where non-terminal nodes represent testst on one or more attributes and terminal nodes reflect decision i outcomes.
Tbl Table 1.1 11 Hypothetical Training i Data for Disease Diagnosis i Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold
Swollen Glands No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy Yes Diagnosis = Cold Figure 1.1 A decision tree for the data in Table 1.1
Table 1.2 Data Instances with an Unknown Classification Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis 11 No No Yes Yes Yes? 12 Yes Yes No No Yes? 13 No No No No Yes?
Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
Unsupervised Clustering A data mining method that builds models dl from dt data without t predefined dfi d classes.
The Acme Investors Dataset Table 1.3 Acme Investors Incorporated Customer Account Margin Transaction Trades/ Favorite Annual ID Type Account Mthd Method Month Sex Age Recreation Income 1005 Joint No Online 12.5 F 30 39 Tennis 40 59K 1013 Custodial No Broker 0.5 F 50 59 Skiing 80 99K 1245 Joint No Online 3.6 M 20 29 Golf 20 39K 2110 Individual Yes Broker 22.3 M 30 39 Fishing 40 59K 1001 Individual Yes Online 5.0 M 40 49 Golf 60 79K
The Acme Investors Dataset & Supervised dlearning 1. Can I develop a general profile of an online investor? 2. Can I determine if a new customer is likely to open a margin account? 3. Can I build a model predict the average number of trades per month for a new investor? 4. What characteristics differentiate female and male investors?
The Acme Investors Dataset & Unsupervised Clustering 1. What attribute similarities group customers of Acme Investors together? 2. What differences in attribute values segment tthe customer database?
1.3 Is Data Mining Appropriate for My Problem?
Data Mining or Data Query? Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge
Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database.
Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.
Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.
Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.
Data Mining vs. Data Query: An Example Use data query if you already almost know what you are looking for. Use data mining to find regularities in data that are not obvious.
1.4 Expert Systems or Data Mining? i
Expert System A computer program that emulates the problem-solving skills of one or more human experts.
Knowledge Engineer A person trained to interact with an expert in order to capture their knowledge.
Data Data Mining Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat Human Expert Knowledge Engineer Expert System Building Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat Figure 1.2 Data mining vs. expert systems
1.5 A Simple Data Mining Process Model
Operational Database SQL Queries Data Warehouse Data Mining Interpretation & Evaluation Result Application Figure 1.3 A simple data mining process model
Assembling the Data The Data Warehouse Relational Databases and Flat Files
The Data Warehouse The data warehouse is a historical dtb database designed dfor decision ii support.
Mining the Data
Interpreting the Results
Result Application
1.6 Why Not Simple Search? Nearest Neighbor Classifier K-nearest Neighbor Classifier
Nearest Neighbor Classifier Classification is performed by searching the training data for the instance closest in distance to the unknown instance.
1.7 Data Mining Applications
Customer Intrinsic Value
Intrinsic (Predicted) Value _ X X X X X X X X X Actual Value Figure 1.4 Intrinsic vs. actual customer value