Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean- Gabriel.Ganascia@lip6.



Similar documents
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

An Introduction to Data Mining

Content-Based Recommendation

Data mining techniques: decision trees

An Introduction to WEKA. As presented by PACE

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Didacticiel Études de cas

Introduction Predictive Analytics Tools: Weka

Social Media Mining. Data Mining Essentials

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

In this tutorial, we try to build a roc curve from a logistic regression.

Data Mining with Weka

Final Project Report

Machine learning for algo trading

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

1. Classification problems

Open-Source Machine Learning: R Meets Weka

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

TECH TUTORIAL: EMBEDDING ANALYTICS INTO A DATABASE USING SOURCEPRO AND JMSL

COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

DATA MINING ALPHA MINER

Visualizing class probability estimators

COC131 Data Mining - Clustering

GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining

Orange Data Mining Library Documentation

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Web Document Clustering

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

THE COMPARISON OF DATA MINING TOOLS

Analysis Tools and Libraries for BigData

Data Mining of Web Access Logs

Improving spam mail filtering using classification algorithms with discretization Filter

Machine Learning What, how, why?

CSC 177 Fall 2014 Team Project Final Report

Model Deployment. Dr. Saed Sayad. University of Toronto

Data Mining. Dr. Saed Sayad. University of Toronto

8. Machine Learning Applied Artificial Intelligence

Azure Machine Learning, SQL Data Mining and R

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Data Mining + Business Intelligence. Integration, Design and Implementation

2 Decision tree + Cross-validation with R (package rpart)

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

An Overview of Knowledge Discovery Database and Data mining Techniques

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Graphical Representation of Multivariate Data

Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

DATA MINING TECHNIQUES AND APPLICATIONS

Contents WEKA Microsoft SQL Database

Data Mining and Visualization

Make Better Decisions Through Predictive Intelligence

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Chapter 12 Discovering New Knowledge Data Mining

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Course Syllabus. Purposes of Course:

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Maschinelles Lernen mit MATLAB

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Data Mining Techniques for Prognosis in Pancreatic Cancer

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

3F3: Signal and Pattern Processing

IT services for analyses of various data samples

Tutorial Exercises for the Weka Explorer

Introduction to Data Mining (DM) and Knowledge Discovery In Data (KDD) Alexandros Kalousis

Learning from Diversity

Data Mining with SQL Server Data Tools

ANALYTICS IN BIG DATA ERA

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

How To Solve The Kd Cup 2010 Challenge

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

The Prophecy-Prototype of Prediction modeling tool

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Identifying SPAM with Predictive Models

TDS - Socio-Environmental Data Science

Machine Learning with MATLAB David Willingham Application Engineer

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

A Survey of Open Source Data Mining Systems

Fast Analytics on Big Data with H20

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

How To Predict Web Site Visits

Machine Learning Techniques for Data Mining

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

A Survey on Pre-processing and Post-processing Techniques in Data Mining

CS Data Science and Visualization Spring 2016

Data Mining: STATISTICA

Data Mining: Overview. What is Data Mining?

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Analytics on Big Data

The Scientific Data Mining Process

Teaching Data Mining in the Era of Big Data

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Transcription:

Data Mining Tools Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean- Gabriel.Ganascia@lip6.fr

DATA BASES Data mining Extraction Data mining Interpretation/ Visualization Evaluation Pre-treatment Selection DB DB DB DB Reformulation K. domain Reducing dimensions. supervised non-supervised Graphs Rules, 3D, RA, VR... SQL / OQL adhoc Google, Yahoo, AltaVista,... sequences symbolic symbolic sequences Wspot ID3, C4.5, Equipe CHARADE ACASA Cobweb, LIP6 UPMC FLEXPAT Sorbonne Universités FOIL, REMO,... COING

Free Tools R- project: statistical library TANAGRA Sipina (Lyon), http://eric.univ- lyon2.fr/~ricco/tanagra/fr/tanagra.html Weka New Zeeland (Java language) Orange Slovania (Python language) RapidMiner (Yale) AlphaMiner Mallet Machine Learning for Language Toolkit (Java language) http://mallet.cs.umass.edu University Massachusetts

What do those tools contain? Input [ile File format.tab arff etc.

Input type.tab Line 1 attribute name Line 2 attribute type Line 3 class Separation: tab Example [ile lenses.tab age prescription astigmatic tear_rate lenses discrete discrete discrete discrete discrete class young myope no reduced none young myope no normal soft presbyopic hypermetrope yes normal none

Entrée «ARFF» Attribute- Relation File Format Entête Commentaires précédés par % @RELATION <nom relation> (1 ligne) @ATTRIBUTE <nom attribut> <Type attribut> (liste de tous les attributs 1 par ligne) @DATA <val A1>, <val A2>, (liste de tous les exemples 1 par ligne) Type: Numeric <nominal- specimication> - ensemble valeurs String entre apostrophes s il la chaîne contient des blancs Date[<format date>]

Example ARFF Header % 1. Title: Plants data base IRIS % % 2. Sources: % (A) Creator: RA Fisher % (B) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (C) Date: July, 1988 % @ Iris RELATION @ Attribute sepallength NUMERIC @ Attribute sepalwidth NUMERIC @ Attribute petallength NUMERIC @ Attribute petalwidth NUMERIC @ Class Attribute {Iris-setosa, Iris versicolor, Iris-virginica}

Example ARFF Data @ Data 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa 5.4,3.9,1.7,0.4, Iris-setosa 4.6,3.4,1.4,0.3, Iris-setosa 5.0,3.4,1.5,0.2, Iris-setosa 4.4,2.9,1.4,0.2, Iris-setosa 4.9,3.1,1.5,0.1, Iris-setosa

Sparse ARFF If there are many null values The same, except for data Non null attributes are identi[ied by their rank Example ARFF @data 0, X, 0, Y, class A 0, 0, W, 0, class B Example Sparse ARFF @data {1 X, 3 Y, 4 class A } {2 W, 4 class B } Remark: the absent values correspond to 0 missing values are identimied with?

Other steps Data preparation Feature selection Data selection Digitalization Sampling Outliers File fusion (joint) Concatenation Data visualization Classification Regression Evaluation Non supervised learning Association rules Text mining

Data visualization Exploratory Data Analysis Distributions Linear projection Attribute statistics Correspondence analysis Mosaic diagrams

Classi[ication Bayesian classification Logistic regression K nearest neighbor Trees C4.5 CN2 SVM Visualization of the classification Trees CN2 rules

Non supervised learning Matrix distance from examples Matrix distance from attributes Dendrograms K-means

Evaluation supervised learning Separation Random Leave one out Cross validation Indices Precision-recall ROC Test training set/ test set Confusion matrix ROC analysis Prediction

Association rules Extraction of association rules Visualization of association rules Frequent sets

Specialized applications Bioinformatics Genomes data bases Gene selection Profiles Text mining Text file Preprocessing (TF.IDF, lemmatization, stemmatization, ) Bags of words N-grams of characters N-grams of words Feature extraction Distance

Weka Written in Java

Weka http://www.cs.waikato.ac.nz/ml/weka/

Orange University of Ljubljana Slovenia Programmed with Python http://www.ailab.si/orange/