Data Mining Primitives

Outline Data Mining Primitives Motivation Data mining primitives Data mining query languages Designing GUI for data mining systems Architectures CS 5331 by Rattikorn Hewett Texas Tech University 1 2 Motivations: Why primitives? Data mining primitives Data mining systems uncover a large set of patterns not all are interesting Data mining should be an interactive process User directs what to be mined Users need data mining primitives to communicate with the data mining system by incorporating them in a data mining query language Benefits: More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice Data mining tasks can be specified in the form of data mining queries by five data mining primitives: Task-relevant data input The kinds of knowledge to be mined function & output Background knowledge interpretation Interestingness measures evaluation Visualization of the discovered patterns presentation 3 4 1

Task-relevant data Specify data to be mined Database, data warehouse, relation, cube Condition for selection & grouping Relevant attributes Knowledge to be mined Specify data mining functions : /discrimination Association Classification/prediction Clustering 5 6 Background Knowledge Typically, in the form of concept hierarchies Schema hierarchy E.g., street < city < state < country Set-grouping hierarchy E.g., {low, high} all, {30..49} low, {50..100} high Operation-derived hierarchy E.g., email address: dmbook@cs.ttu.edu login-name < department < university < organization Rule-based hierarchy E.g., 87 temperature < 90 normal_temperature Interestingness Objective measures: Simplicity: simpler rules are easier to understand and likely to be interesting (association) rule length, (decision) tree size Certainty: validity of the rule Rule A => B has confidence, P(A B) = #(A and B)/ #(B) classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility: potential usefulness Rule A => B has support, #(A and B)/ sample size noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules) 7 8 2

Visualization of Discovered Patterns DMQL(data mining query language) Specify the form to view the patterns E.g., rules, tables, chart, decision trees, cubes, reports etc. Specify operations for data exploration in multiple levels of abstraction E.g., drill-down, roll-up etc. A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language Hope to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptance DMQL is designed with the primitives described earlier 9 10 Languages & Standardization Efforts Designing GUI based on DMQL Association rule language specifications MSQL (Imielinski & Virmani 99) MineRule (Meo Psaila and Ceri 96) Query flocks based on Datalog syntax (Tsur et al 98) OLEDB for DM (Microsoft 2000) Based on OLE, OLE DB, OLE DB for OLAP Integrating DBMS, data warehouse and data mining CRISP-DM (CRoss-Industry Standard Process for Data Mining) Providing a platform and process structure for effective data mining Emphasizing on deploying data mining technology to solve business problems What tasks should be considered in the design GUIs based on a data mining query language? Data collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other information 11 12 3

Architectures Coupling data mining system with DB/DW system No coupling - Flat file processing, not recommended Loose coupling - Fetching data from DB/DW Semi-tight coupling - Enhanced DM performance Provide efficient implementation of a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling - A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Concept Description 13 CS 5331 by Rattikorn Hewett Texas Tech University 14 Outline Review terms Descriptive vs. predictive data mining Descriptive: describes the data set in concise, summarative, informative, discriminative forms Predictive: constructs models representing the data set, and uses them to predict behaviors of unknown data Concept description: involves Characterization: provides a concise and succinct summarization of the given collection of data Comparison (discrimination): provides descriptions comparing two or more collections of data 15 16 4

Concept Description vs. OLAP Concept description: can handle complex data types (e.g., text, image) of the attributes and their aggregations a more automated process OLAP: restricted to a small number of dimension and measure data types user-controlled process Outline 17 18 Characterization methods Summarization by OLAP One approach for characterization is to transform data from low conceptual levels to high ones data generalization E.g., daily sales annual sales Biology Science Two Methods: Summarization as in Data Cube s OLAP Hierarchical generalization Attribute-oriented induction Data generalization? 19 Data are stored in data cubes Identify summarization computations e.g., count( ), sum( ), average( ), max( ) Perform computations and store results in data cubes Generalization and specialization can be performed on a data cube by roll-up and drill-down An efficient implementation of data generalization Limitations: Can handle only simple non-numeric data type of dimensions Can handle only summarization of numeric data Do not guide users which dimensions to explore or which levels to reach 20 5

Outline Attribute-Oriented Induction Proposed in 1989 (KDD 89 workshop) Not confined to categorical data nor particular measures. How is it done? Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users 21 22 Basic Elements General Steps Data focusing: task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal and Attribute-generalization: Attribute A has a large set of distinct values If there is no generalization operator on A, or A s higher level concepts are expressed in terms of other attributes (giving redundancy) Remove A If there exists a set of generalization operators on A Select an operator to generalize A Generalization threshold controls Attribute generalization: controls size of attribute values for generalization or removal (~ 2-8, specified/default) Relation generalization: controls the final relation/rule size (~ 10-30). 23 InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a prime generalized relation, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 24 6

Example DMQL: Describe general characteristics of graduate students in the Big-University database use Big_University_DB mine characteristics as Science_Students in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in graduate Transform to corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in { Msc, MBA, PhD } 25 Initial Relation Example (cont.) Prime Relation Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim Woodman Scott Lachance Laura Lee Removed M M F Retained CS CS Physics to Sci,Eng,Bus Vancouver,BC, Canada Montreal, Que, Canada Seattle, WA, USA to Country 8-12-76 28-7-75 25-8-70 to Age range 3511 Main St., Richmond 345 1st Ave., Richmond 125 Austin Ave., Burnaby to City 687-4598 3.67 253-9106 3.70 420-5232 Gender Major Birth_ country Age_range Residence GPA Count Presentation M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 3.83 Removed to Excl, VG,.. 26 Presentation of results relation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: Mapping results into cross tabulation Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., t = typical grad( x) Ùmale( x) Þ birth_ region( x) = " Canada"[ t:53%] Úbirth_ region( x) = " foreign"[ t:47%]. Outline 27 28 7

Analysis of Attribute Relevance To filter out statistically irrelevant attributes or rank attributes for mining Irrelevant attributes inaccurate/unnecessary complex patterns An attribute is highly relevant for classifying/predicting a class, if it is likely that its values can be used to distinguish the class from others E.g., to describe cheap vs. expensive cars Is color a relevant attribute? What about using color to compare banana and apple? Methods Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept These measures can be: Information gain The Gini index Uncertainty Correlation coefficients 29 30 Example Example (cont) Relevance measure: Information gain Review formulae: For an attribute value set S, each labeled with a class in C and p i is a probability that class i is in S, then Ent( S) = -å pi log2 pi iîc Expected information needed to classify a sample if it is partitioned into S i s for data point that has A s value i Si I( = å Ent( Si) S iîdom( Information gain: Gain( = Ent(S) I( 31 How much attribute major is relevant to classification of graduate/undergraduate students? Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 Dom(Major) = {Science, Eng, Business} 120 Graduates 130 Undergraduates Partition the data into Sc, Eng, Bus representing a set of data points whose Major is Science, Eng and Business, respectively 32 8

Example (cont) Ent( S) = -å 2 I( = pi log pi iîc Si i å iîdom( Ent( S ) S Example (cont) Ent( S) = -å 2 I( = pi log pi iîc Si i å iîdom( Ent( S ) S Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Ent(S) = 120/250 log 2 (120/250) 130/250 log 2 (130/250) = 0.9988 Ent(Sc) = 84/126 log 2 (84/126) 42/126 log 2 (42/126) =. Ent(Eng) = 36/82log 2 (36/82) 46/82 log 2 (46/82) =. Ent(Bus) = 0/42 log 2 (0/42) 42/42 log 2 (42/42) =. I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873 Gain(Major) = Ent(S) I(Major) = 0.9988 0.7873 = 0.2115 Class Information captured from S Expected class information induced by attribute Major 33 Gender Major Birth_ country Age_range GPA Count M Science Canada 20-25 Very-good 16 F Science Foreign 25-30 Excellent 22 M Eng Foreign. 18 F Science Foreign 25 M Science Canada.. 21 F Eng Canada 18 M Science Foreign 18 F Business Canada.. 20 M Business Canada 22 F Science Canada.. 24 M Eng Foreign 22 F Eng Canada 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gain(Major) = Ent(S) I(Major) = 0.9988 0.7873 = 0.2115 Similarly, find Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GP We can rank importance or degree of relevance by Gain values We can use a threshold to prune out attributes that are less relevant 34 Outline Class comparison Goal: mine properties (or rules) to compare a target class with a contrasting class The two classes must be comparable E.g., address and gender are not comparable store_address and home_address are comparable CS students and Eng students are comparable Comparable classes should be generalized to the same conceptual level Approaches Use attribute-oriented induction or data cube to generalize data for two contrasting classes and then compare the results ---!!!! Pattern Recognition approach Approximate discriminating rules from a data set, repeatedly fine-tune until errors are small enough 35 36 9

Outline Descriptive statistical measures Data Characteristics that can be computed Central Tendency mean When is mean not an appropriate measure? median For a very large data set, how do we compute Dispersion median? five number summary: Min, Quartile1, Median, Quartile3, Max variance, standard deviation Outliers Useful displays Spread about the mean. What does var = 0 mean? Detected by rules of thumb: values falling at least 1.5 of (Q3-Q1) above Q3 or below Q1 Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve 37 38 References E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997. Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, DMQL: A Data Mining Query Language for Relational Databases, DMKD'96, Montreal, Canada, June 1996. T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov. 1994. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998. 39 10