Data Mining in Weka Bringing It All together

Size: px
Start display at page:

Download "Data Mining in Weka Bringing It All together"

Transcription

1 Data Mining in Weka Bringing It All together Predictive Analytics Center of Excellence (PACE) San Diego Super Computer Center, UCSD Data Mining Boot Camp 1 Introduction The project assignment demonstrates an example of an end-to-end data mining process supported by the Weka software to build supervised and unsupervised models for analysis. A sick.arff data set will be used to illustrate a set of steps and actions though the process. There are several versions of this file available in the.arff format. More details and the data set description can be found in the Appendix A. We will be using the Explorer component of Weka for this project. Part 1 Data Exploration 1. Data and descriptors The dataset for this project contains 30 attributes of patient data describing patient information regarding the thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from 1984 to early Files The following file is supplied for the project: sick.arff descriptor and activity values This file can be found at: 3. Exercise 1: Preprocess the Data It is important to preprocess understand and properly preprocess the data. Some of the key factors that need to be considered are total number of instances, number of attributes, number of continuous and/or discrete attributes, number of missing values, etc. Step by step instructions In the starting interface of Weka, click on the button Explorer. In the Preprocess tab, click on the button Open File. In the file selection interface, select the file sick.arff. 1

2 The dataset is characterized in the Current relation frame: the name, the number of instances, the number of attributes (descriptors + class). We see in this frame that the number of instances is 3772, whereas the number of descriptors is 30. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor T4U shows the distribution of the attribute values in the dataset. Take a note of the number of missing, unique and distinct values. Select the last attribute class in the Attributes frame. 2

3 One can read from the Selected attribute frame that there are 3541 negative and 321 sick class examples in the dataset. Negative compounds are depicted by the blue color whereas sick compounds are depicted by the red color in the histogram. Note the ration of the number of represented class values for each class. Does it seem balanced? Visualization Click on Visualize all button on the lower right to look at class distribution across the entire set of attributes. 3

4 Examine each one of the variables. Did you notice anything? Are there any variables you think should be removed, discretized, and manipulated in any way? Are there any duplicates? Strongly correlated? Attribute Removal Note that the attribute number 28 named TBG has 100% missing values. This attribute together with the related attribute 27 should be removed by adding the checkmark in from of the attribute name and clicking on the Remove button below. Consider attribute 29 - the referral source which seems irrelevant, but that may depend on nature of the disease. 4

5 Discretization Apply the Discretize filter to the Sick dataset Discretize task 1 (DT1): Browse the attribute information details on the sick.arff file. How many of the attributes are numeric? Write down their attribute numbers. Discretize task 2 (DT2): In the Preprocess panel. Choose the supervised Discretization filter (filters.supervised.attribute.discretize) and apply (using default settings). Browse the details of the attributes you wrote down in DT1. How are they different? How many distinct ranges have been created for each attribute? 5

6 Discretize task 3 (DT3): Undo the Discretize filter. Change the filter to the unsupervised Discretization filter (filters.unsupervised.attribute.discretize) and set the bins setting to 5 (filter settings are found by rightclicking on the box to the right of the Choose button-show properties option). Leave the other settings as default and click Apply. Have a look at the attributes that you wrote down in DT1. Undo the filter and redo it with the bins set to 10. What do you think the bins setting affects? 6

7 Undo the Discretize filter and go to the Classify panel. We will start with model building in the next section but will come back to Discritization to check how different discretization filters might influence the produced models later. Building the Clustering (Simple k-means) Model In this exercise, we will create the simple k-means models for predicting the thyroid disease outcome. In the Clustering frame, click Chose, than select the Simple K-Means method. 7

8 Click on the Start button to build the simple k-means model. Notice that there are not sick clusters. Click with the right mouse button on the word SimpleKMeasn in the Clustering frame. The window for setting options for the k-means method pops up. Change the option numclusters to a larger number in order to create at least one cluster with the majority of the sick class. 8

9 How many clusters does it take? You can change other parameters as well distance metric, seed, etc. How does that influence your produced clusters? Exercise 2: Model Building Building the ZeroR model In this exercise, we will build the trivial model ZeroR, in which all compounds are classified as nonactive. The goal is to demonstrate that the accuracy is not a correct choice to measure the performance of classification for unbalanced datasets, in which the number of negative diagnoses is much larger than the number of sick ones. Click on the tab Classify. The ZeroR method is already selected by default. For assessing the predictive performance of all models to be built, the 10-fold cross-validation method has also be specified by default. Click on the Start button to build a model. 9

10 The predictive performance of the model is characterized in the right-hand Classifier output frame. The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It can be seen from it that all compounds have been classified as negative. It is clear that such trivial model is unusable and it cannot be used for discovering sick patients. However, it is worth noticing that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very high: %. This fact clearly indicates that the accuracy cannot be used for assessing the usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to use the Kappa statistic, which is zero for this case. Kappa statistic is an analog of correlation coefficient. Its value is zero for the lack of any relation and approaches to one for very strong statistical relation between the class label and attributes of instances, i.e. between the classes of healthy or sick and the values of their descriptors. Another useful statistical characteristic is ROC Area, for which the value near 0.5 means the lack of any statistical dependence. Building the Naïve Bayesian Model In this exercise, we build a Naïve Bayesian model for predicting the thyroid disease outcome. The goal is to demonstrate the ability of Weka to build statistically significant classification models for predicting the class outcome, as well as to show different ways of assessing the statistical significance and usefulness of classification models. In the classifier frame, clicks Chose, and then select the NaiveBayes method from the Bayes submenu. Click on the Start button to build a model. 1

11 Not only did the accuracy of the model increase ( % to %), its real statistical significance became much stronger. This follows from the value of the Kappa statistic of 0.58, which indicates the existence of moderate statistical dependence. It can be analyzed using the 1

12 Confusion Matrix at the bottom of the Classifier output window. So, there are 3384 true positive, 173 true negative, 157 false positive, and 58 false negative examples. The model exhibits an excellent value of ROC Area for negative compounds 0.96 and has significantly improved the ROC are for the sick as well. This indicates that this Naïve Bayesian model could very advantageously be used for discovering thyroid patients outcome. This can clearly be shown by analyzing ROC and Cost/Benefit plots. The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess the value of the probability (varying from 0 to 1) that a given patient with particular characteristic can be predicted as negative or sick. By moving the threshold from 0 to 1 and imposing that an outcome can be predicted as sick if the corresponding probability exceeds the current threshold, one can build the ROC (Receiver Operating Characteristic) curve. Extra exercise for additional practice with the ROC Curve: Visualize the ROC curve by clicking the right mouse button on the model type bayes.naivebayes in the Result list frame and selecting the menu item Visualize threshold curve / active. The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the threshold. The colder (closer to the blue) color corresponds to the lower threshold value. All outcomes with probability of being sick exceeding the current threshold are predicted as sick. If such prediction made for a current outcome is correct, then the corresponding outcome is true positive, otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model with such threshold can be used to extract selectively sick outcomes from its mixture with the big number of negative ones. 1

13 In order to find the optimal value of the threshold (or the optimal part of patients to be predicted and diagnosed with the thyroid disease), one can perform the cost/benefit analysis. Close the window with the ROC curve Open the window for the cost/benefit analysis by clicking the right mouse button on the model type bayes.naivebayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active. Click on the Minimize Cost/Benefit button at the right bottom corner of the window. 1

14 Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the Threshold curve corresponds to the part of selected instances (the Sample Size ). In other words, the Threshold curve depicts the dependence of the part of diseased patients retrieved in the course of predicting selected from the whole dataset (ie only those selected for which the estimated probability of having thyroid disease exceeds the chosen threshold). The value of the threshold can be modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The confusion matrix for the current value of the threshold is shown in the Confusion Matrix frame at the left bottom corner of the window. Pay attention that the confusion matrix for the current value of the threshold will sharply differ from the previously obtained one. Why is this happening? In order to give an answer to this question and explain the corresponding phenomenon, let us take a look at the right side of the window. Its right bottom corner contains the Cost Matrix frame. The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should pay for decisions taken on the base of the classification model. The cost values are expressed in the table in abstract units, however in the case studies they can be considered in money scale, for example, in US Dollars. The left bottom cell of the Cost matrix defines the cost of false positives. Its default value is 1 unit. In the case Thyroid disease this corresponds to the mean price one should pay in order to treat a patient wrongly predicted by the model as sick. The right top cell of the Cost matrix defines the cost of false negatives. Its default value is 1 unit. In the case of thyroid disease this corresponds to the mean price one should pay for throwing away a sick patient and losing a successful treatment because of the wrong prediction taken by the classification model. It is also taken by default that one should not pay price for correct decision taken using the classification model. It is clear that all these settings can be changed in order to match the real situation taking place in the process of problem at hand. In order to find the threshold corresponding to the minimum cost, it is sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the second confusion matrix results from the value of the threshold found by minimizing the cost function. The current value of the cost is compared by the program with the cost of selecting the same number of 14

15 instances at random. The difference between the values of the cost function between the random selection and the current value of the cost is called Gain, indicated at the right side of the frame. In the context of thyroid disease, the Gain can be interpreted as the benefit obtained by using the classification model instead of random selection of the same number of patients. Unfortunately, the current version of the Weka software does not provide the means of automatic maximization of the Gain function. However, this can easily be done interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. Close the window with the Cost/Benefit Analysis. Extra exercise for additional practice with the Discretize filter: Set the classifier to Naive Bayes. Select Cross-Validation from the test options, run the test and record the accuracy. Discretize task 5 (DT5): Run the same test as in the (DT2) task, but using the unsupervised Discretize filter. Run this test three times, changing the bins setting from 5 to 10 to 20. Record the accuracy of each test result. Discretize task 6 (DT6): Compare all 5 accuracy readings. Which test had the highest accuracy rate? What can you say about discretization in general? Supervised discretization versus unsupervised discretization? What difference does the size of the bins make using unsupervised discretization? Building a Classification Tree Model In this exercise, we build a classification tree model (using the Decision Tree method named in Weka as J48) for predicting the thyroid disease diagnosis of the patients. The goal is to learn the possibilities offered by the Weka software to build and visualize classification trees. In the classifier frame, click Choose, then select the J48 method from the trees submenu. Click on the Start button. 15

16 The statistical parameters of the J48 model appears quite high in this case, while added bonus is the strength of the individual classification tree stems is their interpretation ability. In order to visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up. In order to obtain more usual representation of the same tree, do the following. Click the right mouse button on the model type trees.j48 in the Result list frame and select the menu item Visualize tree. Resize a new window with graphical representation of the tree 16

17 Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit to screen. The Tree View graphical diagram can be used to visualize decision trees. It contains two types of nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a feature depicted by the specified fingerprint bit number. If the answer is yes, then the node connected with the previous one with the = on branch is queried next. Otherwise, the = off branch is activated. The tree top node is queried the first. The leaves of the tree, depicted by rectangular, contain final decisions, whether the current compound is active or not. Extra Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model. 17

18 Final Exercise: Compare the evaluations of all of the different Methods you ran in this exercise. Which one would you use for this particular problem and why? 18

19 UCI Machine Learning Repository APPENDIX Thyroid Disease Data Set Download: Data Folder, Data Set Description Abstract: 10 separate databases from Garavan Institute Data Set Characteristics: Multivariate, Domain- Theory Number of Instances: 7200 Area: Life Attribute Characteristics: Categorical, Real Number of Attributes: 21 Date Donated Associated Tasks: Classification Missing Values? N/A Number of Web Hits: Source: Thyroid disease records supplied by the Garavan Institute and J. RossQuinlan, New South Wales Institute, Syndney, Australia; Data Set Information: # From Garavan Institute # Documentation: as given by Ross Quinlan # 6 databases from the Garavan Institute in Sydney, Australia # Approximately the following for each database: ** 2800 training (data) instances and 972 test instances ** Plenty of missing data ** 29 or so attributes, either Boolean or continuously-valued # 2 additional databases, also from Ross Quinlan, are also here ** Hypothyroid.data and sick-euthyroid.data ** Quinlan believes that these databases have been corrupted ** Their format is highly similar to the other databases # 1 more database of 9172 instances that cover 20 classes, and a related domain theory # Another thyroid database from Stefan Aeberhard ** 3 classes, 215 instances, 5 attributes ** No missing values # A Thyroid database suited for training ANNs ** 3 classes ** 3772 training instances, 3428 testing instances ** Includes cost data (donated by Peter Turney) 19

20 Attribute Information: sick, negative. classes age: continuous. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. goitre: f, t. tumor: f, t. hypopituitary: f, t. psych: f, t. TSH measured: f, t. TSH: continuous. T3 measured: f, t. T3: continuous. TT4 measured: f, t. TT4: continuous. T4U measured: f, t. T4U: continuous. FTI measured: f, t. FTI: continuous. TBG measured: f, t. TBG: continuous. referral source: WEST, STMW, SVHC, SVI, SVHD, other. Num Instances: 3772 Num Attributes: 30 Num Continuous: 7 (Int 1 / Real 6) Num Discrete: 23 Missing values: 6064 /5.4% Relevant Papers: Quinlan,J.R., Compton,P.J., Horn,K.A., & Lazurus,L. (1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia. [Web Link] Quinlan,J.R. (1986). Induction of decision trees. Machine Learning, 1, [Web Link] 20

WEKA Explorer User Guide for Version 3-4-3

WEKA Explorer User Guide for Version 3-4-3 WEKA Explorer User Guide for Version 3-4-3 Richard Kirkby Eibe Frank November 9, 2004 c 2002, 2004 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 2 Section Tabs................................

More information

WEKA KnowledgeFlow Tutorial for Version 3-5-8

WEKA KnowledgeFlow Tutorial for Version 3-5-8 WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

More information

Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Data Mining with SQL Server Data Tools

Data Mining with SQL Server Data Tools Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining

More information

COC131 Data Mining - Clustering

COC131 Data Mining - Clustering COC131 Data Mining - Clustering Martin D. Sykora [email protected] Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Maschinelles Lernen mit MATLAB

Maschinelles Lernen mit MATLAB Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

Hierarchical Clustering Analysis

Hierarchical Clustering Analysis Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria

More information

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY

More information

Data Mining with Weka

Data Mining with Weka Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Decision Support AITS University Administration. Web Intelligence Rich Client 4.1 User Guide

Decision Support AITS University Administration. Web Intelligence Rich Client 4.1 User Guide Decision Support AITS University Administration Web Intelligence Rich Client 4.1 User Guide 2 P age Web Intelligence 4.1 User Guide Web Intelligence 4.1 User Guide Contents Getting Started in Web Intelligence

More information

InfiniteInsight 6.5 sp4

InfiniteInsight 6.5 sp4 End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Navios Quick Reference

Navios Quick Reference Navios Quick Reference Purpose: The purpose of this Quick Reference is to provide a simple step by step outline of the information needed to perform various tasks on the system. We begin with basic tasks

More information

Windows XP Pro: Basics 1

Windows XP Pro: Basics 1 NORTHWEST MISSOURI STATE UNIVERSITY ONLINE USER S GUIDE 2004 Windows XP Pro: Basics 1 Getting on the Northwest Network Getting on the Northwest network is easy with a university-provided PC, which has

More information

LESSON 7: IMPORTING AND VECTORIZING A BITMAP IMAGE

LESSON 7: IMPORTING AND VECTORIZING A BITMAP IMAGE LESSON 7: IMPORTING AND VECTORIZING A BITMAP IMAGE In this lesson we ll learn how to import a bitmap logo, transform it into a vector and perform some editing on the vector to clean it up. The concepts

More information

SAS Analyst for Windows Tutorial

SAS Analyst for Windows Tutorial Updated: August 2012 Table of Contents Section 1: Introduction... 3 1.1 About this Document... 3 1.2 Introduction to Version 8 of SAS... 3 Section 2: An Overview of SAS V.8 for Windows... 3 2.1 Navigating

More information

Petrel TIPS&TRICKS from SCM

Petrel TIPS&TRICKS from SCM Petrel TIPS&TRICKS from SCM Knowledge Worth Sharing Histograms and SGS Modeling Histograms are used daily for interpretation, quality control, and modeling in Petrel. This TIPS&TRICKS document briefly

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Appendix 2.1 Tabular and Graphical Methods Using Excel

Appendix 2.1 Tabular and Graphical Methods Using Excel Appendix 2.1 Tabular and Graphical Methods Using Excel 1 Appendix 2.1 Tabular and Graphical Methods Using Excel The instructions in this section begin by describing the entry of data into an Excel spreadsheet.

More information

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

More information

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents: Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap

More information

Using Microsoft Word. Working With Objects

Using Microsoft Word. Working With Objects Using Microsoft Word Many Word documents will require elements that were created in programs other than Word, such as the picture to the right. Nontext elements in a document are referred to as Objects

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]

More information

Instructions for Use. CyAn ADP. High-speed Analyzer. Summit 4.3. 0000050G June 2008. Beckman Coulter, Inc. 4300 N. Harbor Blvd. Fullerton, CA 92835

Instructions for Use. CyAn ADP. High-speed Analyzer. Summit 4.3. 0000050G June 2008. Beckman Coulter, Inc. 4300 N. Harbor Blvd. Fullerton, CA 92835 Instructions for Use CyAn ADP High-speed Analyzer Summit 4.3 0000050G June 2008 Beckman Coulter, Inc. 4300 N. Harbor Blvd. Fullerton, CA 92835 Overview Summit software is a Windows based application that

More information

Fixplot Instruction Manual. (data plotting program)

Fixplot Instruction Manual. (data plotting program) Fixplot Instruction Manual (data plotting program) MANUAL VERSION2 2004 1 1. Introduction The Fixplot program is a component program of Eyenal that allows the user to plot eye position data collected with

More information

A Demonstration of Hierarchical Clustering

A Demonstration of Hierarchical Clustering Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised

More information

Oracle Data Mining Hands On Lab

Oracle Data Mining Hands On Lab Oracle Data Mining Hands On Lab Material provided by Oracle Corporation Vlamis Software Solutions is one of the most respected training organizations in the Oracle Business Intelligence community because

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Bluetooth Installation

Bluetooth Installation Overview Why Bluetooth? There were good reasons to use Bluetooth for this application. First, we've had customer requests for a way to locate the computer farther from the firearm, on the other side of

More information

MultiExperiment Viewer Quickstart Guide

MultiExperiment Viewer Quickstart Guide MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test

More information

Scientific Graphing in Excel 2010

Scientific Graphing in Excel 2010 Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

from Larson Text By Susan Miertschin

from Larson Text By Susan Miertschin Decision Tree Data Mining Example from Larson Text By Susan Miertschin 1 Problem The Maximum Miniatures Marketing Department wants to do a targeted mailing gpromoting the Mythic World line of figurines.

More information

ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software.

ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software. ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software. Contents: Getting Started Page 1 Section 1: File and Data Management Page 1 o 1.1: Loading Single Environmental

More information

Tutorial Exercises for the Weka Explorer

Tutorial Exercises for the Weka Explorer CHAPTER Tutorial Exercises for the Weka Explorer 17 The best way to learn about the Explorer interface is simply to use it. This chapter presents a series of tutorial exercises that will help you learn

More information

Snagit 10. Getting Started Guide. March 2010. 2010 TechSmith Corporation. All rights reserved.

Snagit 10. Getting Started Guide. March 2010. 2010 TechSmith Corporation. All rights reserved. Snagit 10 Getting Started Guide March 2010 2010 TechSmith Corporation. All rights reserved. Introduction If you have just a few minutes or want to know just the basics, this is the place to start. This

More information

Drawing a histogram using Excel

Drawing a histogram using Excel Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

paragraph(s). The bottom mark is for all following lines in that paragraph. The rectangle below the marks moves both marks at the same time.

paragraph(s). The bottom mark is for all following lines in that paragraph. The rectangle below the marks moves both marks at the same time. MS Word, Part 3 & 4 Office 2007 Line Numbering Sometimes it can be helpful to have every line numbered. That way, if someone else is reviewing your document they can tell you exactly which lines they have

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

SQL Server 2014 BI. Lab 04. Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server 2014. Jump to the Lab Overview

SQL Server 2014 BI. Lab 04. Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server 2014. Jump to the Lab Overview SQL Server 2014 BI Lab 04 Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server 2014 Jump to the Lab Overview Terms of Use 2014 Microsoft Corporation. All rights reserved.

More information

PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables

PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables 3 Stacked Bar Graph PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD To explore for a relationship between the categories of two discrete variables 3.1 Introduction to the Stacked Bar Graph «As with the simple

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

How To Understand How Weka Works

How To Understand How Weka Works More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Integrated Accounting System for Mac OS X

Integrated Accounting System for Mac OS X Integrated Accounting System for Mac OS X Program version: 6.3 110401 2011 HansaWorld Ireland Limited, Dublin, Ireland Preface Standard Accounts is a powerful accounting system for Mac OS X. Text in square

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Accountable Care Organization Quality Explorer. Quick Start Guide

Accountable Care Organization Quality Explorer. Quick Start Guide Accountable Care Organization Quality Explorer Quick Start Guide 1 P age Background HealthLandscape (a division of the American Academy of Family Physicians [AAFP]) and the Robert Graham Center for Policy

More information

Contents WEKA Microsoft SQL Database

Contents WEKA Microsoft SQL Database WEKA User Manual Contents WEKA Introduction 3 Background information. 3 Installation. 3 Where to get WEKA... 3 Downloading Information... 3 Opening the program.. 4 Chooser Menu. 4-6 Preprocessing... 6-7

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Systems Dynamics Using Vensim Personal Learning Edition (PLE) Download Vensim PLE at http://vensim.com/freedownload.html

Systems Dynamics Using Vensim Personal Learning Edition (PLE) Download Vensim PLE at http://vensim.com/freedownload.html Systems Dynamics Using Personal Learning Edition (PLE) Download PLE at http://vensim.com/freedownload.html Quick Start Tutorial Preliminaries PLE is software designed for modeling one or more quantities

More information

Quick Help Guide (via SRX-Pro Remote)

Quick Help Guide (via SRX-Pro Remote) Quick Help Guide (via SRX-Pro Remote) 2012 i³ International Inc. The contents of this user manual are protected under copyright and computer program laws. Page 2 SRX-Pro Remote - Quick Help Guide Logging

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

IT462 Lab 5: Clustering with MS SQL Server

IT462 Lab 5: Clustering with MS SQL Server IT462 Lab 5: Clustering with MS SQL Server This lab should give you the chance to practice some of the data mining techniques you've learned in class. Preliminaries: For this lab, you will use the SQL

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

Using the SAS Enterprise Guide (Version 4.2)

Using the SAS Enterprise Guide (Version 4.2) 2011-2012 Using the SAS Enterprise Guide (Version 4.2) Table of Contents Overview of the User Interface... 1 Navigating the Initial Contents of the Workspace... 3 Useful Pull-Down Menus... 3 Working with

More information

SECTION 5: Finalizing Your Workbook

SECTION 5: Finalizing Your Workbook SECTION 5: Finalizing Your Workbook In this section you will learn how to: Protect a workbook Protect a sheet Protect Excel files Unlock cells Use the document inspector Use the compatibility checker Mark

More information

After you complete the survey, compare what you saw on the survey to the actual questions listed below:

After you complete the survey, compare what you saw on the survey to the actual questions listed below: Creating a Basic Survey Using Qualtrics Clayton State University has purchased a campus license to Qualtrics. Both faculty and students can use Qualtrics to create surveys that contain many different types

More information

ENVI Classic Tutorial: Classification Methods

ENVI Classic Tutorial: Classification Methods ENVI Classic Tutorial: Classification Methods Classification Methods 2 Files Used in this Tutorial 2 Examining a Landsat TM Color Image 3 Reviewing Image Colors 3 Using the Cursor Location/Value 4 Examining

More information

Integrated Invoicing and Debt Management System for Mac OS X

Integrated Invoicing and Debt Management System for Mac OS X Integrated Invoicing and Debt Management System for Mac OS X Program version: 6.3 110401 2011 HansaWorld Ireland Limited, Dublin, Ireland Preface Standard Invoicing is a powerful invoicing and debt management

More information

Authorware Install Directions for IE in Windows Vista, Windows 7, and Windows 8

Authorware Install Directions for IE in Windows Vista, Windows 7, and Windows 8 Authorware Install Directions for IE in Windows Vista, Windows 7, and Windows 8 1. Read entire document before continuing. 2. Close all browser windows. There should be no websites open. If you are using

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

IBM SPSS Statistics 20 Part 1: Descriptive Statistics CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic. 2 Scilab. 2.1 What is Scilab? 1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

More information

FirstClass FAQ's An item is missing from my FirstClass desktop

FirstClass FAQ's An item is missing from my FirstClass desktop FirstClass FAQ's An item is missing from my FirstClass desktop Deleted item: If you put a item on your desktop, you can delete it. To determine what kind of item (conference-original, conference-alias,

More information

MetroBoston DataCommon Training

MetroBoston DataCommon Training MetroBoston DataCommon Training Whether you are a data novice or an expert researcher, the MetroBoston DataCommon can help you get the information you need to learn more about your community, understand

More information

Visualization of Phylogenetic Trees and Metadata

Visualization of Phylogenetic Trees and Metadata Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected]

More information

Filtering Email with Microsoft Outlook

Filtering Email with Microsoft Outlook Filtering Email with Microsoft Outlook Microsoft Outlook is an email client that can retrieve and send email from various types of mail servers. It includes some advanced functionality that allows you

More information

UCO_SECURE Wireless Connection Guide: Windows 8

UCO_SECURE Wireless Connection Guide: Windows 8 1 The UCO_SECURE wireless network uses 802.1x encryption to ensure that your data is secure when it is transmitted wirelessly. This security is not enabled by default on Windows computers. In order to

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Gestation Period as a function of Lifespan

Gestation Period as a function of Lifespan This document will show a number of tricks that can be done in Minitab to make attractive graphs. We work first with the file X:\SOR\24\M\ANIMALS.MTP. This first picture was obtained through Graph Plot.

More information

Smart Connection 9 Element Labels

Smart Connection 9 Element Labels 08 Smart Connection 9 Element Labels This document is part of the documentation for Smart Connection 9 and is an extract from the former Smart Connection 9 User Guide for InDesign. For more information

More information

USER MANUAL SlimComputer

USER MANUAL SlimComputer USER MANUAL SlimComputer 1 Contents Contents...2 What is SlimComputer?...2 Introduction...3 The Rating System...3 Buttons on the Main Interface...5 Running the Main Scan...8 Restore...11 Optimizer...14

More information

Internet Explorer 7. Getting Started The Internet Explorer Window. Tabs NEW! Working with the Tab Row. Microsoft QUICK Source

Internet Explorer 7. Getting Started The Internet Explorer Window. Tabs NEW! Working with the Tab Row. Microsoft QUICK Source Microsoft QUICK Source Internet Explorer 7 Getting Started The Internet Explorer Window u v w x y { Using the Command Bar The Command Bar contains shortcut buttons for Internet Explorer tools. To expand

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

DeCyder Extended Data Analysis module Version 1.0

DeCyder Extended Data Analysis module Version 1.0 GE Healthcare DeCyder Extended Data Analysis module Version 1.0 Module for DeCyder 2D version 6.5 User Manual Contents 1 Introduction 1.1 Introduction... 7 1.2 The DeCyder EDA User Manual... 9 1.3 Getting

More information

Using Excel as a Management Reporting Tool with your Minotaur Data. Exercise 1 Customer Item Profitability Reporting Tool for Management

Using Excel as a Management Reporting Tool with your Minotaur Data. Exercise 1 Customer Item Profitability Reporting Tool for Management Using Excel as a Management Reporting Tool with your Minotaur Data with Judith Kirkness These instruction sheets will help you learn: 1. How to export reports from Minotaur to Excel (these instructions

More information

Topographic Change Detection Using CloudCompare Version 1.0

Topographic Change Detection Using CloudCompare Version 1.0 Topographic Change Detection Using CloudCompare Version 1.0 Emily Kleber, Arizona State University Edwin Nissen, Colorado School of Mines J Ramón Arrowsmith, Arizona State University Introduction CloudCompare

More information

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations Supervised DNA barcodes species classification: analysis, comparisons and results Emanuel Weitschek, Giulia Fiscon, and Giovanni Felici Citations If you use this procedure please cite: Weitschek E, Fiscon

More information

The VB development environment

The VB development environment 2 The VB development environment This chapter explains: l how to create a VB project; l how to manipulate controls and their properties at design-time; l how to run a program; l how to handle a button-click

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery

Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery WorldView-2 is the first commercial high-resolution satellite to provide eight spectral sensors in the visible to near-infrared

More information

ATLAS.ti for Mac OS X Getting Started

ATLAS.ti for Mac OS X Getting Started ATLAS.ti for Mac OS X Getting Started 2 ATLAS.ti for Mac OS X Getting Started Copyright 2014 by ATLAS.ti Scientific Software Development GmbH, Berlin. All rights reserved. Manual Version: 5.20140918. Updated

More information