Introduction Predictive Analytics Tools: Weka



Similar documents
An Introduction to WEKA. As presented by PACE

Introduction Predictive Analytics Tools: Weka, R!

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

WEKA Explorer Tutorial

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Pentaho Data Mining Last Modified on January 22, 2007

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

How To Understand How Weka Works

Data Mining with Weka

Open-Source Machine Learning: R Meets Weka

THE COMPARISON OF DATA MINING TOOLS

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

An Introduction to Data Mining

DATA MINING USING PENTAHO / WEKA

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Web Document Clustering

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Analytics on Big Data

Contents WEKA Microsoft SQL Database

WEKA A Machine Learning Workbench for Data Mining

DATA MINING ALPHA MINER

Association Rule Mining: Exercises and Answers

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Data Mining & Data Stream Mining Open Source Tools

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Didacticiel Études de cas

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

The Prophecy-Prototype of Prediction modeling tool

Improving spam mail filtering using classification algorithms with discretization Filter

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

CSC 177 Fall 2014 Team Project Final Report

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Knowledge Discovery in Data with FIT-Miner

The basic data mining algorithms introduced may be enhanced in a number of ways.

The Data Mining Process

WEKA Explorer User Guide for Version 3-4-3

Implementation of Breiman s Random Forest Machine Learning Algorithm

8. Machine Learning Applied Artificial Intelligence

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Data Mining Solutions for the Business Environment

How To Predict Web Site Visits

More Data Mining with Weka

Introduction to Data Mining

WEKA. Machine Learning Algorithms in Java

2 Decision tree + Cross-validation with R (package rpart)

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

Data Mining Practical Machine Learning Tools and Techniques

Azure Machine Learning, SQL Data Mining and R

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Structural Health Monitoring Tools (SHMTools)

WEKA KnowledgeFlow Tutorial for Version 3-5-8

PARAMETRIC COMPARISON OF DATA MINING TOOLS

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

A THREE-TIERED WEB BASED EXPLORATION AND REPORTING TOOL FOR DATA MINING

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Machine Learning. Hands-On for Developers and Technical Professionals

Remote Access to Unix Machines

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

WEKA Experiences with a Java Open-Source Project

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

2015 Workshops for Professors

Hexaware E-book on Predictive Analytics

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

A Comparative study of Techniques in Data Mining

What s Cooking in KNIME

Make Better Decisions Through Predictive Intelligence

Introduction. A. Bellaachia Page: 1

Data Mining Algorithms Part 1. Dejan Sarka

from Larson Text By Susan Miertschin

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Prerequisites. Course Outline

COC131 Data Mining - Clustering

Open-Source Machine Learning: R Meets Weka

How To Solve The Kd Cup 2010 Challenge

Social Media Mining. Data Mining Essentials

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

IBM SPSS Modeler 15 In-Database Mining Guide

Knowledge Discovery Process and Data Mining - Final remarks

A Comparative Study of clustering algorithms Using weka tools

Analysis Tools and Libraries for BigData

Data Mining and Data Warehousing on US Farmer s Data

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Advanced analytics at your hands

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

BIG DATA What it is and how to use?

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Rule based Classification of BSE Stock Data with Data Mining

1 File Processing Systems

Big Data and Data Science: Behind the Buzz Words

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Starting Smart with Oracle Advanced Analytics

Transcription:

Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego

Tools Landscape Considerations Scale User Interface GUI, command line Interactive, Batch, Workflow Scope Data manipulation Methods, packages available Statistics, DMing, predictive analytics Scoring Development

(Some) Available Data Mining Tools COTs: IBM Intelligent Miner SAS Enterprise Miner Oracle ODM Microstrategy Microsoft DBMiner Pentaho Matlab Teradata Open Source: WEKA KNIME Orange RapidMiner NLTK R Rattle h2o Mahout 3

Agenda WEKA Intro and background Data Preparation Creating Models/ Applying Algorithms Evaluating Results

WEKA: Walkthrough with Iris As presented by PACE

Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 6

Content Intro and background Exploring WEKA Data Preparation Creating Models/ Applying Algorithms Evaluating Results 7

What is WEKA? Waikato Environment for Knowledge Analysis WEKA is a data mining/machine learning application developed by Department of Computer Science, University of Waikato, New Zealand WEKA is open source software in JAVA issued under the GNU General Public License WEKA is a collection tools for data pre-processing, classification, regression, clustering, association, and visualization. WEKA is a collection of machine learning algorithms for data mining tasks WEKA is well-suited for developing new machine learning schemes WEKA is a bird found only in New Zealand. 8

Advantages of Weka Free availability Under the GNU General Public License Portability Fully implemented in the Java programming language and thus runs on almost any modern computing platforms Windows, Mac OS X and Linux Comprehensive collection of data preprocessing and modeling techniques Supports standard data mining tasks: data preprocessing, clustering, classification, regression, visualization, and feature selection. Easy to use GUI, and command line Provides access to SQL databases Using Java Database Connectivity and can process the result returned by a database query.

Disadvantages of Weka Sequence modeling is not covered by the algorithms included in the Weka distribution Not capable of multi-relational data mining Memory bound

KDD Process: How does WEKA fit in? Data Selection / Transformation Data Preparation Training Data Data Mining Model Patterns Evaluation / Verification

1 2 WEKA Walk Through: Main GUI Three graphical user interfaces The Explorer (exploratory data analysis) pre-process data build classifiers cluster data find associations attribute selection data visualization The Experimenter (experimental environment) used to compare performance of different learning schemes The KnowledgeFlow (new process model inspired interface) Java-Beans-based interface for setting up and running machine learning experiments. Command line Interface ( Simple CLI ) More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

1 3

WEKA:: Explorer: Preprocess Importing data Data format Uses flat text files to describe the data Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC)

WEKA:: ARFF file format @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Open Iris Dataset University of Waikato 1 6

Open Iris Dataset University of Waikato 1 7

1 8

Weka: Explorer:Preprocess Preprocessing data Visualization Filtering algorithms filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria. Removing Noisy Data Adding Additional Attributes Remove Attributes

University of Waikato 2 0

University of Waikato 2 1

WEKA:: Explorer: Preprocess Used to define filters to transform Data. WEKA contains filters for: Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc

University of Waikato 2 6

University of Waikato 2 7

University of Waikato 2 8

University of Waikato 2 9

WEKA:: Explorer: building classifiers Classifiers in WEKA are models for predicting nominal or numeric quantities Implemented learning schemes include: Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes nets, Meta -classifiers include: Bagging, boosting, stacking, error-correcting output codes, locally weighted learning,

University of Waikato 3 1

University of Waikato 3 2

University of Waikato 3 3

University of Waikato 3 4

University of Waikato 3 5

University of Waikato 3 6

University of Waikato 3 7

University of Waikato 3 8

University of Waikato 3 9

University of Waikato 4 0

University of Waikato 4 1

WEKA:: Explorer: building Cluster WEKA contains clusters for finding groups of similar instances in a dataset Implemented schemes are: k-means, EM, Cobweb, X-means, FarthestFirst Clusters can be visualized and compared to true clusters (if given) Evaluation based on loglikelihood if clustering scheme produces a probability distribution

Explorer: Finding associations WEKA contains an implementation of the Apriori algorithm for learning association rules Works only with discrete data Can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and support 2000) Apriori can compute all rules that have a given minimum support and exceed a given confidence 43

Explorer: Attribute Selection Panel that can be used to investigate which (subsets of) attributes are the most predictive ones Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared, Very flexible: WEKA allows (almost) arbitrary combinations of these two 4 4

Explorer: Visualize Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) To do: rotating 3-d visualizations (Xgobi-style) Color-coded class values Jitter option to deal with nominal attributes (and to detect hidden data points) Zoom-in function 45

4 6

4 7

4 8

4 9

5 0

5 1

5 2

Using Weka on Gordon Use GUI and Command Line Use GUI on login nodes to create command line Use command line to run interactive or batch jobs on production nodes

Weka Gui To launch Weka Gui on a: Windows machine to run software on remote machine with GUI requires a secure shell with x forwarding enabled to establish a remote connection and an X Server to handle the local display. Suggested software putty and Xming» http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html» http://www.straightrunning.com/xmingnotes/ Linux and MAC OS X support X Forwarding Mac users need to run Applications > Utilities > Xterm ssh Y username@gordon-login2.sdsc.edu Load weka module Module load weka At command prompt > weka

Serial/Single-node script java jar weka.jar

References and Resources References: WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html WEKA Tutorial: Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. A presentation which explains how to use Weka for exploratory data mining. WEKA Data Mining Book: Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/main_page

the end!