Data Mining Tools Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean- Gabriel.Ganascia@lip6.fr
DATA BASES Data mining Extraction Data mining Interpretation/ Visualization Evaluation Pre-treatment Selection DB DB DB DB Reformulation K. domain Reducing dimensions. supervised non-supervised Graphs Rules, 3D, RA, VR... SQL / OQL adhoc Google, Yahoo, AltaVista,... sequences symbolic symbolic sequences Wspot ID3, C4.5, Equipe CHARADE ACASA Cobweb, LIP6 UPMC FLEXPAT Sorbonne Universités FOIL, REMO,... COING
Free Tools R- project: statistical library TANAGRA Sipina (Lyon), http://eric.univ- lyon2.fr/~ricco/tanagra/fr/tanagra.html Weka New Zeeland (Java language) Orange Slovania (Python language) RapidMiner (Yale) AlphaMiner Mallet Machine Learning for Language Toolkit (Java language) http://mallet.cs.umass.edu University Massachusetts
What do those tools contain? Input [ile File format.tab arff etc.
Input type.tab Line 1 attribute name Line 2 attribute type Line 3 class Separation: tab Example [ile lenses.tab age prescription astigmatic tear_rate lenses discrete discrete discrete discrete discrete class young myope no reduced none young myope no normal soft presbyopic hypermetrope yes normal none
Entrée «ARFF» Attribute- Relation File Format Entête Commentaires précédés par % @RELATION <nom relation> (1 ligne) @ATTRIBUTE <nom attribut> <Type attribut> (liste de tous les attributs 1 par ligne) @DATA <val A1>, <val A2>, (liste de tous les exemples 1 par ligne) Type: Numeric <nominal- specimication> - ensemble valeurs String entre apostrophes s il la chaîne contient des blancs Date[<format date>]
Example ARFF Header % 1. Title: Plants data base IRIS % % 2. Sources: % (A) Creator: RA Fisher % (B) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (C) Date: July, 1988 % @ Iris RELATION @ Attribute sepallength NUMERIC @ Attribute sepalwidth NUMERIC @ Attribute petallength NUMERIC @ Attribute petalwidth NUMERIC @ Class Attribute {Iris-setosa, Iris versicolor, Iris-virginica}
Example ARFF Data @ Data 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa 5.4,3.9,1.7,0.4, Iris-setosa 4.6,3.4,1.4,0.3, Iris-setosa 5.0,3.4,1.5,0.2, Iris-setosa 4.4,2.9,1.4,0.2, Iris-setosa 4.9,3.1,1.5,0.1, Iris-setosa
Sparse ARFF If there are many null values The same, except for data Non null attributes are identi[ied by their rank Example ARFF @data 0, X, 0, Y, class A 0, 0, W, 0, class B Example Sparse ARFF @data {1 X, 3 Y, 4 class A } {2 W, 4 class B } Remark: the absent values correspond to 0 missing values are identimied with?
Other steps Data preparation Feature selection Data selection Digitalization Sampling Outliers File fusion (joint) Concatenation Data visualization Classification Regression Evaluation Non supervised learning Association rules Text mining
Data visualization Exploratory Data Analysis Distributions Linear projection Attribute statistics Correspondence analysis Mosaic diagrams
Classi[ication Bayesian classification Logistic regression K nearest neighbor Trees C4.5 CN2 SVM Visualization of the classification Trees CN2 rules
Non supervised learning Matrix distance from examples Matrix distance from attributes Dendrograms K-means
Evaluation supervised learning Separation Random Leave one out Cross validation Indices Precision-recall ROC Test training set/ test set Confusion matrix ROC analysis Prediction
Association rules Extraction of association rules Visualization of association rules Frequent sets
Specialized applications Bioinformatics Genomes data bases Gene selection Profiles Text mining Text file Preprocessing (TF.IDF, lemmatization, stemmatization, ) Bags of words N-grams of characters N-grams of words Feature extraction Distance
Weka Written in Java
Weka http://www.cs.waikato.ac.nz/ml/weka/
Orange University of Ljubljana Slovenia Programmed with Python http://www.ailab.si/orange/