Machine Learning and Data Mining An Introduction with WEKA

Similar documents
Data Mining with Weka

Monday Morning Data Mining

Social Media Mining. Data Mining Essentials

Chapter 12 Discovering New Knowledge Data Mining

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

8. Machine Learning Applied Artificial Intelligence

Knowledge-based systems and the need for learning

Knowledge Discovery and Data Mining

Decision-Tree Learning

Data Mining of Web Access Logs

More Data Mining with Weka

Machine Learning Techniques for Data Mining

WEKA Explorer Tutorial

Data Mining Practical Machine Learning Tools and Techniques

Data Mining: STATISTICA

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Lecture 6 - Data Mining Processes

Framing Business Problems as Data Mining Problems

Data Mining III: Numeric Estimation

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

A Decision Tree for Weather Prediction

Data Mining on Streams

CHAPTER 18 Programming Your App to Make Decisions: Conditional Blocks

Data Mining Part 5. Prediction

An Introduction to the WEKA Data Mining System

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Implementation of Breiman s Random Forest Machine Learning Algorithm

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

An Introduction to WEKA. As presented by PACE

Microsoft Azure Machine learning Algorithms

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

THE COMPARISON OF DATA MINING TOOLS

Data mining techniques: decision trees

Classification and Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Introduction Predictive Analytics Tools: Weka

Data Mining. Nonlinear Classification

COC131 Data Mining - Clustering

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Big Data: The Science of Patterns. Dr. Lutz Hamel Dept. of Computer Science and Statistics

Lecture 10: Regression Trees

Supervised Learning (Big Data Analytics)

WEKA. Machine Learning Algorithms in Java

Tutorial Exercises for the Weka Explorer

An Introduction to Data Mining

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Advanced analytics at your hands

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Data Mining for Knowledge Management. Classification

Foundations of Artificial Intelligence. Introduction to Data Mining

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Web Document Clustering

Pentaho Data Mining Last Modified on January 22, 2007

not possible or was possible at a high cost for collecting the data.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Machine Learning using MapReduce

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining Techniques Chapter 7: Artificial Neural Networks

Data Mining Classification: Decision Trees

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

How To Understand How Weka Works

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

from Larson Text By Susan Miertschin

Classification Techniques (1)

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

MBA Data Mining & Knowledge Discovery

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Course 395: Machine Learning

Data Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA

Data Engineering for the Analysis of Semiconductor Manufacturing Data

Model Deployment. Dr. Saed Sayad. University of Toronto

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Programming Your App to Make Decisions: Conditional Blocks

Overview. Data Mining. Predicting Stock Market Returns. Predicting Health Risk. Wharton Department of Statistics. Wharton

A Splitting Criteria Based on Similarity in Decision Tree Learning

Text Analytics Illustrated with a Simple Data Set

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Chapter 6. The stacking ensemble approach

Business Intelligence. Data Mining and Optimization for Decision Making

Boosting.

Weather forecast prediction: a Data Mining application

Data Mining - Introduction

The Scientific Data Mining Process

Transcription:

Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/18/10 - Dr. Martin Based on slides by Gregory Piatetsky-Shapiro from Kdnuggets http://www.kdnuggets.com/data_mining_course/

Some review What are we doing? Data Mining And a really brief intro to machine learning

Finding patterns Goal: programs that detect patterns and regularities in the data Strong patterns good predictions Problem 1: most patterns are not interesting Problem 2: patterns may be inexact (or spurious) Problem 3: data may be garbled or missing

Machine learning techniques Algorithms for acquiring structural descriptions from examples Structural descriptions represent patterns explicitly Can be used to predict outcome in new situation Can be used to understand and explain how prediction is derived (may be even more important) Methods originate from artificial intelligence, statistics, and research on databases witten&eibe

Can machines really learn? Definitions of learning from dictionary: To get knowledge of by study, experience, or being taught To become aware by information or from observation To commit to memory To be informed of, ascertain; to receive instruction Operational definition: Difficult to measure Trivial for computers Things learn when they change their behavior in a way that makes them perform better in the future. Does a slipper learn? Does learning imply intention? witten&eibe

Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Baian, Neural Networks,... Given a set of points from classes what is the class of new point?

Classification: Linear Regression Linear Regression w 0 + w 1 x + w 2 y >= 0 Regression computes wi from data to minimize squared error to fit the data Not flexible enough

Classification: Decision Trees Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 3 2 5 X

Classification: Neural Nets Can select more complex regions Can be more accurate Also can overfit the data find patterns in random noise

Built in Data Sets Weka comes with some built in data sets Described in chapter 1 We ll start with the Weather Problem Toy (very small) Data is entirely fictitious

But First Components of the input: Concepts: kinds of things that can be learned Aim: intelligible and operational concept description Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: measuring aspects of an instance We will focus on nominal and numeric ones

What s in an attribute? Each instance is described by a fixed predefined set of features, its attributes But: number of attributes may vary in practice Possible solution: irrelevant value flag Related problem: existence of an attribute may depend of value of another one Possible attribute types ( levels of measurement ): Nominal, ordinal, interval and ratio witten&eibe

What s a concept? Data Mining Tasks (Styles of learning): Classification learning: predicting a discrete class Association learning: detecting associations between features Clustering: grouping similar instances into clusters Numeric prediction: predicting a numeric quantity Concept: thing to be learned Concept description: output of learning scheme witten&eibe

The weather problem Outlook sunny sunny overcast rainy rainy rainy overcast Temperature hot hot hot mild mild mild mild Humidity high high high high normal normal normal Windy true true true Play no no no Given past data, Can you come up with the rules for Play/Not Play? What is the game? sunny mild high no sunny mild normal rainy mild normal sunny mild normal true overcast mild high true overcast hot normal rainy mild high true no

The weather problem Given this data, what are the rules for play/not play? Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes

The weather problem Conditions for playing Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = If humidity = normal then play = If none of the above then play = witten&eibe

Weather data with mixed attributes no true 91 71 rainy 75 81 overcast true 90 72 overcast true 70 75 sunny 80 75 rainy 70 69 sunny no 95 72 sunny true 65 64 overcast no true 70 65 rainy 80 68 rainy 96 70 rainy 86 83 overcast no true 90 80 sunny no 85 85 sunny Play Windy Humidity Temperature Outlook

Weather data with mixed attributes How will the rules change when some attributes have numeric values? Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes

Weather data with mixed attributes Rules with mixed attributes Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = If humidity < 85 then play = If none of the above then play = witten&eibe

Some fun with WEKA Open WEKA preferably in Linux We need to find the data file find. -name \*arff -ls May want to copy into an easier place to get to gunzip *.gz Take a look at the file format

The ARFF format % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, } @attribute play? {, no} @data sunny, 85, 85,, no sunny, 80, 90, true, no overcast, 83, 86,,... witten&eibe

Open Weka Explorer Open file Choose weather.arff Note that if you have a file in.csv format E.g. from Excel It can be opened and will be automatically converted to.arff format

Weka

Classifying Weather Data Click on Classify Choose ba -> NaïveBaSimple Choose trees -> J48 Try some more

Keep Exploring Try the iris data set Does it work better?