A Comparison of Leading Data Mining Tools



Similar documents
Enterprise Miner (EM) Intelligent Miner for Data (IM) Pattern Recognition Workbench (PRW)

Introduction to Data Mining

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Make Better Decisions Through Predictive Intelligence

Polynomial Neural Network Discovery Client User Guide

Leveraging Ensemble Models in SAS Enterprise Miner

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining: Overview. What is Data Mining?

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Knowledge Discovery and Data Mining

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining Practical Machine Learning Tools and Techniques

IBM SPSS Modeler Professional

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Course Syllabus. Purposes of Course:

Data Mining + Business Intelligence. Integration, Design and Implementation

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

IBM SPSS Modeler Professional

Data Mining mit der JMSL Numerical Library for Java Applications

Prerequisites. Course Outline

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

KnowledgeSEEKER Marketing Edition

2015 Workshops for Professors

CUSTOMER Presentation of SAP Predictive Analytics

Data Mining. Nonlinear Classification

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

A fast, powerful data mining workbench designed for small to midsize organizations

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

DATA MINING - SELECTED TOPICS

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

How to Optimize Your Data Mining Environment

Knowledge Discovery from patents using KMX Text Analytics

An Overview and Evaluation of Decision Tree Methodology

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

An In-Depth Look at In-Memory Predictive Analytics for Developers

APPROACHABLE ANALYTICS MAKING SENSE OF DATA

L25: Ensemble learning

Data Mining for Fun and Profit

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Ensembles and PMML in KNIME

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

CART 6.0 Feature Matrix

Knowledge Discovery and Data Mining

The Prophecy-Prototype of Prediction modeling tool

Principles of Data Mining by Hand&Mannila&Smyth

CS590D: Data Mining Chris Clifton

An Introduction to Data Mining

A THREE-TIERED WEB BASED EXPLORATION AND REPORTING TOOL FOR DATA MINING

IBM SPSS Modeler 15 In-Database Mining Guide

Why Ensembles Win Data Mining Competitions

Chapter 6. The stacking ensemble approach

How To Perform An Ensemble Analysis

Get to Know the IBM SPSS Product Portfolio

A Review of Data Mining Techniques

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

Supervised Learning (Big Data Analytics)

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

A New Approach for Evaluation of Data Mining Techniques

Database Marketing, Business Intelligence and Knowledge Discovery

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Decision Trees from large Databases: SLIQ

Data Mining Methods: Applications for Institutional Research

not possible or was possible at a high cost for collecting the data.

An Overview of Knowledge Discovery Database and Data mining Techniques

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

IBM SPSS Modeler 14.2 In-Database Mining Guide

Advanced analytics at your hands

Classification of Bad Accounts in Credit Card Industry

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

IBM Cognos Business Intelligence Scorecarding

Data Mining and Visualization

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining - Evaluation of Classifiers

Gerry Hobbs, Department of Statistics, West Virginia University

Introduction Predictive Analytics Tools: Weka

Making confident decisions with the full spectrum of analysis capabilities

Transcription:

A Comparison of Leading Data Mining Tools John F. Elder IV & Dean W. Abbott Elder Research Fourth International Conference on Knowledge Discovery & Data Mining Friday, August 28, 1998 New York, New York

Copyright 1998, John F. Elder IV and Dean W. Abbott All rights reserved Manufactured in the United States of America 1998 Elder Research T8-2

Contacting Elder Research http://www.datamininglab.com Dr. John F. Elder IV 1006 Wildmere Place Charlottesville, VA 22901 Dean W. Abbott 3443 Villanova Avenue San Diego, CA 92122-2310 elder@datamininglab.com 804-973-7673 Fax: 804-995-0064 dean@datamininglab.com 619-450-0313 1998 Elder Research T8-3

Tutorial Goals Compare and Summarize Data Mining Tools which: Offer multiple modeling and classification algorithms Support project stages surrounding model construction Stand alone Are general-purpose Cost a lot We could get our hands on Include some (focused) Desktop Tools Other Reports: Two Crows, Aberdeen Group, Elder Research (forthcoming ), Data Mining Journal 1998 Elder Research T8-4

Topics Products covered Review of algorithms Comparative tables of properties Screen shots exemplifying qualities Summary of distinctives 1998 Elder Research T8-5

Caveats We don t know every tool well (and are sure to have missed some!) Level of exposure noted for each tool Our background (biasing our perspective) Very technical, early adopters Emphasize solving real-world applications More classification than estimation Field of tools is quite dynamic New versions appear regularly 1998 Elder Research T8-6

Data Mining Products PRW Model 1 1998 Elder Research T8-7

Tools Evaluated Product Company URL Version Tested Our Experience Clementine Integral Solutions, Integral Ltd. Solutions, Ltd. http://www.isl.co.uk/clem.html 4 Moderate Darwin Thinking Machines, Corp. Corp. http://www.think.com/html/products/products.htm 3.0.1 Moderate DataCruncher DataMind http://www.datamindcorp.com 2.1.1 High Enterprise Miner SAS Institute http://www.sas.com/software/components/miner.html Beta Moderate GainSmarts Urban Science http://www.urbanscience.com/main/gainpage.htm 4.0.3 Low Intelligent Miner IBM http://www.software.ibm.com/data/iminer/ 2 Low MineSet Silicon Graphics, Inc. Inc. http://www.sgi.com/products/software/mineset/ 2.5 Low Model 1 Group 11/Unica Technologies http://www.unica-usa.com/model1.htm 3.1 Moderate ModelQuest AbTech Corp. http://www.abtech.com 1 Moderate PRW Unica Technologies, Inc. http://www.unica-usa.com/prodinfo.htm 2.1 High CART Salford Systems http://www.salford-systems.com 3.5 Moderate NeuroShell Ward Systems Group, Inc. http://www.wardsystems.com/neuroshe.htm 3 Moderate OLPARS PAR Government Systems mailto://olpars@partech.com 8.1 High Scenario Cognos http://www.cognos.com/busintell/products/index.html 2 Moderate See5 RuleQuest Research http://www.rulequest.com/see5-info.html 1.07 Moderate S-Plus MathSoft http://www.mathsoft.com/splus/ 4 High WizWhy WizSoft http://www.wizsoft.com/why.html 1.1 Moderate 1998 Elder Research T8-8

Categories for Comparisons Platforms Supported Algorithms Included Decision Trees Neural Networks Other Data Input and Model Output Options Usability Ratings Visualization Capabilities Modeling Automation Methods 1998 Elder Research T8-9

Platforms PC Standalone (95/NT) Unix Standalone Unix Server / PC Client 1998 Elder Research T8-10 NT Server / PC Client Database Connectivity Clementine + Darwin DataCruncher Enterprise Miner + GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART + Scenario NeuroShell OLPARS See5 + S-Plus WizWhy Key blank no capability some capability good capability + excellent capability

Tool Groupings Desktop PC (standalone) Flat Files One or Two Algorithms High End Multiple Platforms, Client- Server Flat Files or Direct Database Access Multiple Algorithm Types Data Fits into RAM Large Databases 1998 Elder Research T8-11

End User Perspectives Business Intuitive Interface Clear steps in data mining process Non-technical terminology Familiar environment Descriptive Reporting Domain terminology Graphical representations Technical Algorithm Options Knobs to enhance model performance Model Automation Simplify model design cycle Documentation of steps used in generating models (repeatability) 1998 Elder Research T8-12

OLPARS: Interface 1998 Elder Research T8-13

MineSet: Interface 1998 Elder Research T8-14

PRW: Experiment Manager 1998 Elder Research T8-15

Intelligent Miner: Explorer Interface 1998 Elder Research T8-16

Enterprise Miner: Visual Interface 1998 Elder Research T8-17

Clementine: Visual Interface 1998 Elder Research T8-18

DataCruncher: Process Flow Diagram 1998 Elder Research T8-19

Data Input & Model Output Automatic Header Save Data Format ODBC Native Database Drivers Summary Reports Output Source Code Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy 1998 Elder Research T8-20

PRW: Row Splitting 1998 Elder Research T8-21

Model1: Variable Selection 1998 Elder Research T8-22

Model1: Data Definitions 1998 Elder Research T8-23

Scenario : Special Data Types 1998 Elder Research T8-24

Decision Trees n a > 4 y n b>3.5 y n b > 2 y n a > 1 y 2 2 3 1 0 1998 Elder Research T8-25

Polynomial Networks Z 17 = 3.1 + 0.4a -.15b 2 + 0.9bc - 0.62abc + 0.5c 3 Layer 0 (Normalizers) Layer 1 a N1 z1 z9 Double 16 k N9 z6 Single 14 f N6 d N4 z4 Triple 17 z16 z14 z17 Layer 2 Multi- Linear 15 z15 Layer 3 Triple 21 z21 Unitizers U2 Y1 h N8 z8 Double 20 z20 Double 19 z19 U7 Y2 e N5 z5 1998 Elder Research T8-26

Consensus Models Parametrically Summarize Data Points orders, terms Regression Polynomial Networks (e.g. GMDH, ASPN) Decision Trees (e.g., CART, CHAID, C5) Logistic or Sigmoidal Networks (ANNs) Hinging Hyperplanes, MARS 1998 Elder Research T8-27

Consensus Models (continued) orientation, bin width Histogram function Radial Basis Function family, order Wavelets 1998 Elder Research T8-28

Contributory Models retain data points; each potentially affects estimate at new point shape, spread Kernels k, distance metric k-nearest Neighbor Goal, iterations Delaunay Planes Spread, index Projection Pursuit Regression 1998 Elder Research T8-29

Properties of Algorithms Algorithm Accurate Scalable Interpretable Useable Robust Versatile Fast Hot Classical (LR, LDA) Neural Networks Visualization Decision Trees Polynomial Networks K-Nearest Neighbors Kernels Key good neutral bad 1998 Elder Research T8-30

Algorithms Decision Trees Linear/Statistical Multi-layer Perceptrons Nearest Neighbor Radial Basis Functions Bayes Rule Induction Polynomial Networks Generalized Linear Models Time Series Sequential Discovery K Means Association Rules Kohonen Clementine Darwin Datamind Enterprise Miner GainSmarts + Intelligent Miner + MineSet Model 1 + ModelQuest PRW + CART Cognos NeuroShell + OLPARS See5 SPlus + WizWhy 1998 Elder Research T8-31

Multi-Layer Perceptrons Learning Rate Learning Rate Decay Momentum Multiple Activation Functions Multiple Stop Criteria Cross-Validation Normalize Inputs Advanced Learning Alg. Other Cost functions Automatic Model Selection Network Visual Parameter Summary Clementine Darwin Enterprise Miner Intelligent Miner Model 1 PRW NeuroShell OLPARS 1998 Elder Research T8-32

Enterprise Miner: Neural Network Training 1998 Elder Research T8-33

PRW: Neural Network Training 1998 Elder Research T8-34

Decision Trees "CART" C5 or C4.5 CHAID Other Priors Classification Costs Missing Data Pruning Severity Visual Trees Clementine Darwin Enterprise Miner + GainSmarts Intelligent Miner MineSet Model 1 ModelQuest CART + Scenario S-Plus See5 + 1998 Elder Research T8-35

CART: Tree Browsing 1998 Elder Research T8-36

Scenario: Tree Browsing 1998 Elder Research T8-37

Enterprise Miner: Tree Browsing 1998 Elder Research T8-38

Enterprise Miner: Tree Results 1998 Elder Research T8-39

MineSet: Tree Browser 1998 Elder Research T8-40

Regression / Stats Linear Logistic Clementine Complexity Penalty Cross- Validation Input Selection Factor Analysis Clementine Y Enterprise Miner + + GainSmarts + + Intelligent Miner MineSet Model 1 + ModelQuest Enterprise PRW S-Plus + + S-Plus + + Scenario 1998 Elder Research T8-41

MineSet: Bayes Distributions 1998 Elder Research T8-42

Enterprise Miner: Clustering Results 1998 Elder Research T8-43

Intelligent Miner: Clustering Results 1998 Elder Research T8-44

Intelligent Miner: Cluster Explode 1998 Elder Research T8-45

Usability Data Loading and Manipulation Model Building Model Understanding Technical Support Overall Clementine + + + + + Darwin + DataCruncher + + Enterprise Miner GainSmarts + Intelligent Miner MineSet + + + Model 1 + + + + + ModelQuest Enterprise + + + + PRW + + + + + CART Scenario + + + NeuroShell OLPARS See5 S-Plus + WizWhy + 1998 Elder Research T8-46

Intelligent Miner: Statistics Report 1998 Elder Research T8-47

Scenario: Result Reporting 1998 Elder Research T8-48

WizWhy: Reporting 1998 Elder Research T8-49

DataCruncher: Output Sensitivities 1998 Elder Research T8-50

Darwin: Predictions 1998 Elder Research T8-51

Darwin: Lift Chart (Excel) 1998 Elder Research T8-52

CART: Gains Chart 1998 Elder Research T8-53

Visualization Histograms Pie Charts Scatter/ Line Plots Rotating Scatter Conditional Plots Classification Decision Regions Correlation Plots Clementine Darwin DataCruncher Enterprise Miner GainSmarts Intelligent Miner MineSet Model 1 ModelQuest Enterprise PRW CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy 1998 Elder Research T8-54

Clementine: Visualization User-created sub-region 1998 Elder Research T8-55

OLPARS: Visualization 1998 Elder Research T8-56

OLPARS: Decision Space 1998 Elder Research T8-57

Model 1: Target Sensitivities 1998 Elder Research T8-58

MineSet: Geographical Visualization 1998 Elder Research T8-59

Automation Method of Automation Free Text Annotation of Steps Clementine Visual Programming, Programming Language Darwin Programming Language DataCruncher (Task manager) Enterprise Miner Visual Programming, Programming Language GainSmarts Macro Language, Wizards Intelligent Miner (Wizards) MineSet Data History, Log Model 1 Model Wizard ModelQuest Batch Agenda PRW Experiement Manager; Macros CART Scenario NeuroShell OLPARS See5 S-Plus WizWhy Built-in Basic Scripting Scripting (S); C/C++ 1998 Elder Research T8-60

PRW: Experiment Manager 1998 Elder Research T8-61

Model 1: Model Summary 1998 Elder Research T8-62

A Recent Breakthrough: Bundling 1) Construct varied models, and 2) Combine their estimates Generate component models by varying: Case Weights Data Values Guiding Parameters Variable Subsets Combine estimates using: Estimator Weights Voting Advisor Perceptrons Partitions of Design Space 1998 Elder Research T8-63

Example Bundling Techniques Bayes: sum estimates of possible models, weighted by priors GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using two inputs each, fit by LR Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out estimates of 1st-level (neural net) models Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build trees mostly); take majority vote or average Bumping (Tibshirani 97) -- bootstrap, select single best Boosting (Freund & Shapire 96) -- weight error cases by βτ = (1-e(t))/e(t), iteratively re-model; weight model t by ln(βτ) Crumpling (Anderson & Elder 98) -- average cross-validations Born-Again (Breiman 98) -- invent new X data... 1998 Elder Research T8-64

Distinctives Strengths Weaknesses Clementine visual interface; algorithm breadth scalability Darwin efficient client-server; intuitive interface options no unsupervised; limited visualization DataCruncher ease of use single algorithm Enterprise Miner depth of algorithms; visual interface harder to use; new product issues GainSmarts data transformations, built on SAS; algorithm option depth no unsupervised; limited visualization Intelligent Miner algorithm breadth; graphical tree/cluster output few algorithm options; no automation MineSet data visualization few algorithms; no model export Model 1 ease of use; automated model discovery really a vertical tool ModelQuest breadth of algorithms some non-intuitive interface options PRW extensive algorithms; automated model selection limited visualization CART depth of tree options difficult file I/O; limited visualization Scenario ease of use narrow analysis path NeuroShell multiple neural network architectures unorthodox interface; only neural networks OLPARS multiple statistical algorithms; class-based visualization dated interface; difficult file I/O See5 depth of tree options limited visualization; few data options S-Plus depth of algorithms; visualization; programable/extendable limited inductive methods; steep learning curve WizWhy ease of use; ease of model understanding limited visualization 1998 Elder Research T8-65

Closing Observations Data Mining Tools Can: Enhance inference process Speed up design cycle Data Mining Tools Can Not: Substitute for statistical and domain expertise Users are advised to: Get training on tools Be alert for product upgrades 1998 Elder Research T8-66

Index to Tools (Page numbers in italics refer to screen captures) Clementine 7, 8, 10, 18, 20, 31, 32, 35, 41, 46, 54, 55, 60, 65 Darwin 7, 8, 10, 20, 31, 32, 35, 46, 51, 52, 54, 60, 65 DataCruncher 7, 8, 10, 19, 20, 31, 46, 50, 54, 60, 65 Enterprise Miner 7, 8, 10, 17, 20, 31, 32, 33, 35, 38, 39, 41, 43, 46, 54, 60, 65 Gainsmarts 7, 8, 10, 20, 31, 35, 41, 46, 54, 60, 65 Intelligent Miner 7, 8, 10, 16, 20, 31, 32, 35, 41, 44, 45, 46, 47, 54, 60, 65 MineSet 7, 8, 10, 14, 20, 31, 35, 40, 41, 42, 46, 54, 59, 60, 65 Model1 7, 8, 10, 20, 22, 23, 31, 32, 35, 41, 46, 54, 58, 60, 62, 65 ModelQuest 7, 8, 10, 20, 31, 35, 41, 46, 54, 60, 65 PRW 7, 8, 10, 15, 20, 21, 31, 32, 34, 41, 46, 54, 60, 61, 65 CART 7, 8, 10, 20, 31, 35, 36, 46, 53, 54, 60, 65 Neuroshell 7, 8, 10, 20, 31, 32, 46, 54, 60, 65 pcolpars 7, 8, 10, 13, 20, 31, 32, 46, 54, 56, 57, 60, 65 Scenario 7, 8, 10, 20, 24, 31, 35, 37, 41, 46, 48, 54, 60, 65 See5 7, 8, 10, 20, 31, 35, 46, 54, 60, 65 S-Plus 7, 8, 10, 20, 31, 35, 41, 46, 54, 60, 65 WizWhy 7, 8, 10, 20, 31, 46, 49, 54, 60, 65 1998 Elder Research T8-67

Forthcoming Report Report provides detailed comparison of high-end data mining tools, including capabilities, ease of use, and practical tips. Available for $695 from Elder Research (http://www.datamininglab.com), Q4 1998. Purchasers receive brief free consulting session to explore report findings in more detail, if desired. Note: The analyses and reviews were performed completely independently, and were made possible by the cooperation of the vendors, for which Elder Research is very grateful. The companies, however, provided no financial support, and had no influence on its editorial content. 1998 Elder Research T8-68