Data Mining Primitives

Similar documents
What is Data Mining?

Data Mining: Concepts and Techniques Chapter 1 Introduction

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

CHAPTER-12. Analytical Characterization : Analysis of Attribute Relevance

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Introduction. What is Data Mining?

Integrating Pattern Mining in Relational Databases

Mining various patterns in sequential data in an SQL-like manner *

Massive Data Analytics

Data Mining: Concepts and Techniques

Building Data Cubes and Mining Them. Jelena Jovanovic

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Introduction. A. Bellaachia Page: 1

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

33 Data Mining Query Languages

Relational Databases and Data Warehouses æ. Jiawei Han Jenny Y. Chiang Sonny Chee Jianping Chen Qing Chen

Knowledge Discovery in Databases. Databases. date name surname street city account no. payment balance

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Data Preprocessing. Week 2

Data Mining and Database Systems: Where is the Intersection?

Introduction to Data Mining

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

CS590D: Data Mining Chris Clifton

II. OLAP(ONLINE ANALYTICAL PROCESSING)

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

A Database Perspective on Knowledge Discovery

Data Mining. Session 4 Main Theme Data Warehousing and OLAP. Dr. Jean-Claude Franchitti

Chapter 20: Data Analysis

Analyzing Polls and News Headlines Using Business Intelligence Techniques

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Data Mining: Concepts and Techniques. Solution Manual

Classification and Prediction

Thomas M. Tirpak, Weimin Xiao Motorola Labs 1301 E. Algonquin Rd. Schaumburg, IL USA {T.Tirpak, {kzhao,

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

MANAGING LARGE COLLECTIONS

Visual Data Mining in Indian Election System

Mining an Online Auctions Data Warehouse

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Data W a Ware r house house and and OLAP II Week 6 1

Data Exploration Data Visualization

CREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS

Introduction. Introduction. Spatial Data Mining: Definition WHAT S THE DIFFERENCE?

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

DATA MINING QUERY LANGUAGES

Data Mining as Part of Knowledge Discovery in Databases (KDD)

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Microsoft Excel 2010 Pivot Tables

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

MINING CLICKSTREAM-BASED DATA CUBES

Diagrams and Graphs of Statistical Data

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

SQL Server Administrator Introduction - 3 Days Objectives

BIA and BO integration other performance management options Crystal Reports Basic: Fundamentals of Report Design

An architecture for an effective usage of data mining in business intelligence systems

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Foundation of Quantitative Data Analysis

Data Mining - Introduction

White Paper April 2006

Implementing Data Models and Reports with Microsoft SQL Server

ReportPortal Web Reporting for Microsoft SQL Server Analysis Services

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

An Overview of Database management System, Data warehousing and Data Mining

Data Warehouse design

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Business Intelligence for SUPRA. WHITE PAPER Cincom In-depth Analysis and Review

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

Data Visualization Handbook

International Journal of Advanced Research in Computer Science and Software Engineering

Chapter 3: Cluster Analysis

Association rules for improving website effectiveness: case analysis

Microsoft Implementing Data Models and Reports with Microsoft SQL Server

ETL PROCESS IN DATA WAREHOUSE

Knowledge Mining for the Business Analyst

Efficient Integration of Data Mining Techniques in Database Management Systems

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Data Mining Applications in Higher Education

Exploratory data analysis (Chapter 2) Fall 2011

Indexing and Data Access Methods for Database Mining

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Week 13: Data Warehousing. Warehousing

Data Warehousing and Data Mining in Business Applications

Week 3 lecture slides

Subjective Measures and their Role in Data Mining Process

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Introduction to Data Mining

What is OLAP - On-line analytical processing

Transcription:

Outline Data Mining Primitives Motivation Data mining primitives Data mining query languages Designing GUI for data mining systems Architectures CS 5331 by Rattikorn Hewett Texas Tech University 1 2 Motivations: Why primitives? Data mining primitives Data mining systems uncover a large set of patterns not all are interesting Data mining should be an interactive process User directs what to be mined Users need data mining primitives to communicate with the data mining system by incorporating them in a data mining query language Benefits: More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice Data mining tasks can be specified in the form of data mining queries by five data mining primitives: Task-relevant data input The kinds of knowledge to be mined function & output Background knowledge interpretation Interestingness measures evaluation Visualization of the discovered patterns presentation 3 4 1

Task-relevant data Specify data to be mined Database, data warehouse, relation, cube Condition for selection & grouping Relevant attributes Knowledge to be mined Specify data mining functions : /discrimination Association Classification/prediction Clustering 5 6 Background Knowledge Typically, in the form of concept hierarchies Schema hierarchy E.g., street < city < state < country Set-grouping hierarchy E.g., {low, high} all, {30..49} low, {50..100} high Operation-derived hierarchy E.g., email address: dmbook@cs.ttu.edu login-name < department < university < organization Rule-based hierarchy E.g., 87 temperature < 90 normal_temperature Interestingness Objective measures: Simplicity: simpler rules are easier to understand and likely to be interesting (association) rule length, (decision) tree size Certainty: validity of the rule Rule A => B has confidence, P(A B) = #(A and B)/ #(B) classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility: potential usefulness Rule A => B has support, #(A and B)/ sample size noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules) 7 8 2

Visualization of Discovered Patterns DMQL(data mining query language) Specify the form to view the patterns E.g., rules, tables, chart, decision trees, cubes, reports etc. Specify operations for data exploration in multiple levels of abstraction E.g., drill-down, roll-up etc. A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language Hope to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptance DMQL is designed with the primitives described earlier 9 10 Languages & Standardization Efforts Designing GUI based on DMQL Association rule language specifications MSQL (Imielinski & Virmani 99) MineRule (Meo Psaila and Ceri 96) Query flocks based on Datalog syntax (Tsur et al 98) OLEDB for DM (Microsoft 2000) Based on OLE, OLE DB, OLE DB for OLAP Integrating DBMS, data warehouse and data mining CRISP-DM (CRoss-Industry Standard Process for Data Mining) Providing a platform and process structure for effective data mining Emphasizing on deploying data mining technology to solve business problems What tasks should be considered in the design GUIs based on a data mining query language? Data collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other information 11 12 3

Architectures Coupling data mining system with DB/DW system No coupling - Flat file processing, not recommended Loose coupling - Fetching data from DB/DW Semi-tight coupling - Enhanced DM performance Provide efficient implementation of a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling - A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Concept Description 13 CS 5331 by Rattikorn Hewett Texas Tech University 14 Outline Review terms Descriptive vs. predictive data mining Descriptive: describes the data set in concise, summarative, informative, discriminative forms Predictive: constructs models representing the data set, and uses them to predict behaviors of unknown data Concept description: involves Characterization: provides a concise and succinct summarization of the given collection of data Comparison (discrimination): provides descriptions comparing two or more collections of data 15 16 4

Concept Description vs. OLAP Concept description: can handle complex data types (e.g., text, image) of the attributes and their aggregations a more automated process OLAP: restricted to a small number of dimension and measure data types user-controlled process Outline 17 18 Characterization methods Summarization by OLAP One approach for characterization is to transform data from low conceptual levels to high ones data generalization E.g., daily sales annual sales Biology Science Two Methods: Summarization as in Data Cube s OLAP Hierarchical generalization Attribute-oriented induction Data generalization? 19 Data are stored in data cubes Identify summarization computations e.g., count( ), sum( ), average( ), max( ) Perform computations and store results in data cubes Generalization and specialization can be performed on a data cube by roll-up and drill-down An efficient implementation of data generalization Limitations: Can handle only simple non-numeric data type of dimensions Can handle only summarization of numeric data Do not guide users which dimensions to explore or which levels to reach 20 5

Outline Attribute-Oriented Induction Proposed in 1989 (KDD 89 workshop) Not confined to categorical data nor particular measures. How is it done? Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users 21 22 Basic Elements General Steps Data focusing: task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal and Attribute-generalization: Attribute A has a large set of distinct values If there is no generalization operator on A, or A s higher level concepts are expressed in terms of other attributes (giving redundancy) Remove A If there exists a set of generalization operators on A Select an operator to generalize A Generalization threshold controls Attribute generalization: controls size of attribute values for generalization or removal (~ 2-8, specified/default) Relation generalization: controls the final relation/rule size (~ 10-30). 23 InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a prime generalized relation, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 24 6

Example DMQL: Describe general characteristics of graduate students in the Big-University database use Big_University_DB mine characteristics as Science_Students in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in graduate Transform to corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in { Msc, MBA, PhD } 25 Initial Relation Example (cont.) Prime Relation Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim Woodman Scott Lachance Laura Lee Removed M M F Retained CS CS Physics to Sci,Eng,Bus Vancouver,BC, Canada Montreal, Que, Canada Seattle, WA, USA to Country 8-12-76 28-7-75 25-8-70 to Age range 3511 Main St., Richmond 345 1st Ave., Richmond 125 Austin Ave., Burnaby to City 687-4598 3.67 253-9106 3.70 420-5232 Gender Major Birth_ country Age_range Residence GPA Count Presentation M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 3.83 Removed to Excl, VG,.. 26 Presentation of results relation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: Mapping results into cross tabulation Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., t = typical grad( x) Ùmale( x) Þ birth_ region( x) = " Canada"[ t:53%] Úbirth_ region( x) = " foreign"[ t:47%]. Outline 27 28 7

Analysis of Attribute Relevance To filter out statistically irrelevant attributes or rank attributes for mining Irrelevant attributes inaccurate/unnecessary complex patterns An attribute is highly relevant for classifying/predicting a class, if it is likely that its values can be used to distinguish the class from others E.g., to describe cheap vs. expensive cars Is color a relevant attribute? What about using color to compare banana and apple? Methods Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept These measures can be: Information gain The Gini index Uncertainty Correlation coefficients 29 30 Example Example (cont) Relevance measure: Information gain Review formulae: For an attribute value set S, each labeled with a class in C and p i is a probability that class i is in S, then Ent( S) = -å pi log2 pi iîc Expected information needed to classify a sample if it is partitioned into S i s for data point that has A s value i Si I( = å Ent( Si) S iîdom( Information gain: Gain( = Ent(S) I( 31 How much attribute major is relevant to classification of graduate/undergraduate students? Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 Dom(Major) = {Science, Eng, Business} 120 Graduates 130 Undergraduates Partition the data into Sc, Eng, Bus representing a set of data points whose Major is Science, Eng and Business, respectively 32 8

Example (cont) Ent( S) = -å 2 I( = pi log pi iîc Si i å iîdom( Ent( S ) S Example (cont) Ent( S) = -å 2 I( = pi log pi iîc Si i å iîdom( Ent( S ) S Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Ent(S) = 120/250 log 2 (120/250) 130/250 log 2 (130/250) = 0.9988 Ent(Sc) = 84/126 log 2 (84/126) 42/126 log 2 (42/126) =. Ent(Eng) = 36/82log 2 (36/82) 46/82 log 2 (46/82) =. Ent(Bus) = 0/42 log 2 (0/42) 42/42 log 2 (42/42) =. I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873 Gain(Major) = Ent(S) I(Major) = 0.9988 0.7873 = 0.2115 Class Information captured from S Expected class information induced by attribute Major 33 Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gain(Major) = Ent(S) I(Major) = 0.9988 0.7873 = 0.2115 Similarly, find Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GP We can rank importance or degree of relevance by Gain values We can use a threshold to prune out attributes that are less relevant 34 Outline Class comparison Goal: mine properties (or rules) to compare a target class with a contrasting class The two classes must be comparable E.g., address and gender are not comparable store_address and home_address are comparable CS students and Eng students are comparable Comparable classes should be generalized to the same conceptual level Approaches Use attribute-oriented induction or data cube to generalize data for two contrasting classes and then compare the results ---!!!! Pattern Recognition approach Approximate discriminating rules from a data set, repeatedly fine-tune until errors are small enough 35 36 9

Outline Descriptive statistical measures Data Characteristics that can be computed Central Tendency mean When is mean not an appropriate measure? median For a very large data set, how do we compute Dispersion median? five number summary: Min, Quartile1, Median, Quartile3, Max variance, standard deviation Outliers Useful displays Spread about the mean. What does var = 0 mean? Detected by rules of thumb: values falling at least 1.5 of (Q3-Q1) above Q3 or below Q1 Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve 37 38 References E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997. Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, DMQL: A Data Mining Query Language for Relational Databases, DMKD'96, Montreal, Canada, June 1996. T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov. 1994. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998. 39 10