The KDD Process: Applying Nuno Cavalheiro Marques (nmm@di.fct.unl.pt) Spring Semester 2010/2011 MSc in Computer Science
Outline I 1 Knowledge Discovery in Data beyond the Computer 2 by Visualization Lift and ROC Charts Multidimensional Data visualization 3 Decision Trees for Representation Information Gain Hypothesis Space Issues SLIQ: A Fast Scalable Classifier for 4 with SOM SOM Training SOM Visualization and Clustering Parallel SOM[SM07]
Outline II 5 References
beyond the Computer The Tabulating Machine
KDD Visualization DMDT DMSOM References beyond the Computer KDD is Interactive... KDD process Knowledge Models Base De Dados, Textos Visualization Input data Clean Data Target Data agregation Preprocessing and cleaning Selection and sampling Data Warehousing
Multidimensional Data visualization Information Visualization and Related Topics Please check PDF file InformationVisualization.
Representation Decision Tree for PlayTennis (in [M97]) Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes
Information Gain Definition I Gain(S, A) = expected reduction in entropy due to sorting on A Gain(S, A) Entropy(S) v Values(A) S v S Entropy(S v )
Hypothesis Space Search by ID3 + + A1 + + + A2 + +...... A2 + + A3 + A2 + + A4......
Hypothesis Space Properties of ID3 Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can t play 20 questions... No back tracking Local minima... Statisically-based search choices Robust to noisy data... Inductive bias: approx prefer shortest tree
Hypothesis Space Inductive Bias in ID3 Note H is the power set of instances X Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam s razor: prefer the shortest hypothesis that fits the data
Issues Gini Index or Entropy? gini(t ) = 1 p 2 j 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0 0,5 1 1,5 Gini H(X)
Issues Continuous Valued Attributes I Create a discrete attribute to test continuous Temperature = 82.5 (Temperature > 72.3) = t, f Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No
Issues Attributes with Many Values I Problem: If attribute has many values, Gain will select it Imagine using Date = Jun 3 1996 as attribute One approach: use GainRatio instead GainRatio(S, A) SplitInformation(S, A) Gain(S, A) SplitInformation(S, A) c i=1 where S i is subset of S for which A has value v i S i S log S i 2 S
Issues Unknown Attribute Values What if some examples missing values of A? Use training example anyway, sort through tree If node n tests A, assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability p i to each possible value v i of A assign fraction p i of example to each descendant in tree Classify new examples in same fashion
SLIQ: A Fast Scalable Classifier for SLIQ: A Fast Scalable Classifier for Please check PDF file for SLIQ presentation...
Basic Model for Self-Organizing Maps (SOM) Basic equations c = argmin i ( x m i ) m i (t + 1) = m i (t) + h ci (t)[x(t) m i (t)] h ci (t) - function for creating the (usually 2D) map effect, relating nearby neurons
SOM Training Competitive Learning Web effect ([?])
SOM Visualization and Clustering Data Visualization and Clustering in SOM UCI s Credit [?] UCI s Adult [SM07]
Parallel SOM[SM07] Training SOM with two phases Topological order small number of epochs Convergence big number of epochs Basic idea: explore the two phase behaviour Figure: SOM train example from [?]
Parallel SOM[SM07] Hybrid Algorithm [SM07] Merge advantages of Network-Partition and Data-Partition algorithms Take advantage of two phases while training SOM Algorithm During Topological order Simple data-partition method During Convergence Hybrid mode for segmenting patterns and map every X epochs
Parallel SOM[SM07] Segmenting the hybrid algorithm Segmenting the map and patterns with histogram Need to measure segment sample migration Figure: Asymmetrical segmentation example
Parallel SOM[SM07] Goal: Qualitative validation of topological information Figure: DS Chainlink and U-Matrix Figure: U-Matrices with hybrid algorithm
Main References I Mitchell, T.M.: Machine Learning. McGraw-Hill (March 1997) Manish Mehta, Rakesh Agrawal and Jorma Rissanen, SLIQ: A Fast Scalable Classifier for, in Advances in Database Technology, LNCS, Vol: 1057/1996.
Main References II Bruno Silva and Nuno Marques. A hybrid parallel som algorithm for large maps in data-mining. In José Neves, Manuel Filipe Santos, and José Machado, editors, New Trends in Artificial Intelligence, Guimarães. Portugal, December 2007. Associação Portuguesa para a Inteligência Artificial (APPIA).