Data Mining in Weka Bringing It All together
|
|
|
- Valentine Riley
- 9 years ago
- Views:
Transcription
1 Data Mining in Weka Bringing It All together Predictive Analytics Center of Excellence (PACE) San Diego Super Computer Center, UCSD Data Mining Boot Camp 1 Introduction The project assignment demonstrates an example of an end-to-end data mining process supported by the Weka software to build supervised and unsupervised models for analysis. A sick.arff data set will be used to illustrate a set of steps and actions though the process. There are several versions of this file available in the.arff format. More details and the data set description can be found in the Appendix A. We will be using the Explorer component of Weka for this project. Part 1 Data Exploration 1. Data and descriptors The dataset for this project contains 30 attributes of patient data describing patient information regarding the thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from 1984 to early Files The following file is supplied for the project: sick.arff descriptor and activity values This file can be found at: 3. Exercise 1: Preprocess the Data It is important to preprocess understand and properly preprocess the data. Some of the key factors that need to be considered are total number of instances, number of attributes, number of continuous and/or discrete attributes, number of missing values, etc. Step by step instructions In the starting interface of Weka, click on the button Explorer. In the Preprocess tab, click on the button Open File. In the file selection interface, select the file sick.arff. 1
2 The dataset is characterized in the Current relation frame: the name, the number of instances, the number of attributes (descriptors + class). We see in this frame that the number of instances is 3772, whereas the number of descriptors is 30. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor T4U shows the distribution of the attribute values in the dataset. Take a note of the number of missing, unique and distinct values. Select the last attribute class in the Attributes frame. 2
3 One can read from the Selected attribute frame that there are 3541 negative and 321 sick class examples in the dataset. Negative compounds are depicted by the blue color whereas sick compounds are depicted by the red color in the histogram. Note the ration of the number of represented class values for each class. Does it seem balanced? Visualization Click on Visualize all button on the lower right to look at class distribution across the entire set of attributes. 3
4 Examine each one of the variables. Did you notice anything? Are there any variables you think should be removed, discretized, and manipulated in any way? Are there any duplicates? Strongly correlated? Attribute Removal Note that the attribute number 28 named TBG has 100% missing values. This attribute together with the related attribute 27 should be removed by adding the checkmark in from of the attribute name and clicking on the Remove button below. Consider attribute 29 - the referral source which seems irrelevant, but that may depend on nature of the disease. 4
5 Discretization Apply the Discretize filter to the Sick dataset Discretize task 1 (DT1): Browse the attribute information details on the sick.arff file. How many of the attributes are numeric? Write down their attribute numbers. Discretize task 2 (DT2): In the Preprocess panel. Choose the supervised Discretization filter (filters.supervised.attribute.discretize) and apply (using default settings). Browse the details of the attributes you wrote down in DT1. How are they different? How many distinct ranges have been created for each attribute? 5
6 Discretize task 3 (DT3): Undo the Discretize filter. Change the filter to the unsupervised Discretization filter (filters.unsupervised.attribute.discretize) and set the bins setting to 5 (filter settings are found by rightclicking on the box to the right of the Choose button-show properties option). Leave the other settings as default and click Apply. Have a look at the attributes that you wrote down in DT1. Undo the filter and redo it with the bins set to 10. What do you think the bins setting affects? 6
7 Undo the Discretize filter and go to the Classify panel. We will start with model building in the next section but will come back to Discritization to check how different discretization filters might influence the produced models later. Building the Clustering (Simple k-means) Model In this exercise, we will create the simple k-means models for predicting the thyroid disease outcome. In the Clustering frame, click Chose, than select the Simple K-Means method. 7
8 Click on the Start button to build the simple k-means model. Notice that there are not sick clusters. Click with the right mouse button on the word SimpleKMeasn in the Clustering frame. The window for setting options for the k-means method pops up. Change the option numclusters to a larger number in order to create at least one cluster with the majority of the sick class. 8
9 How many clusters does it take? You can change other parameters as well distance metric, seed, etc. How does that influence your produced clusters? Exercise 2: Model Building Building the ZeroR model In this exercise, we will build the trivial model ZeroR, in which all compounds are classified as nonactive. The goal is to demonstrate that the accuracy is not a correct choice to measure the performance of classification for unbalanced datasets, in which the number of negative diagnoses is much larger than the number of sick ones. Click on the tab Classify. The ZeroR method is already selected by default. For assessing the predictive performance of all models to be built, the 10-fold cross-validation method has also be specified by default. Click on the Start button to build a model. 9
10 The predictive performance of the model is characterized in the right-hand Classifier output frame. The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It can be seen from it that all compounds have been classified as negative. It is clear that such trivial model is unusable and it cannot be used for discovering sick patients. However, it is worth noticing that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very high: %. This fact clearly indicates that the accuracy cannot be used for assessing the usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to use the Kappa statistic, which is zero for this case. Kappa statistic is an analog of correlation coefficient. Its value is zero for the lack of any relation and approaches to one for very strong statistical relation between the class label and attributes of instances, i.e. between the classes of healthy or sick and the values of their descriptors. Another useful statistical characteristic is ROC Area, for which the value near 0.5 means the lack of any statistical dependence. Building the Naïve Bayesian Model In this exercise, we build a Naïve Bayesian model for predicting the thyroid disease outcome. The goal is to demonstrate the ability of Weka to build statistically significant classification models for predicting the class outcome, as well as to show different ways of assessing the statistical significance and usefulness of classification models. In the classifier frame, clicks Chose, and then select the NaiveBayes method from the Bayes submenu. Click on the Start button to build a model. 1
11 Not only did the accuracy of the model increase ( % to %), its real statistical significance became much stronger. This follows from the value of the Kappa statistic of 0.58, which indicates the existence of moderate statistical dependence. It can be analyzed using the 1
12 Confusion Matrix at the bottom of the Classifier output window. So, there are 3384 true positive, 173 true negative, 157 false positive, and 58 false negative examples. The model exhibits an excellent value of ROC Area for negative compounds 0.96 and has significantly improved the ROC are for the sick as well. This indicates that this Naïve Bayesian model could very advantageously be used for discovering thyroid patients outcome. This can clearly be shown by analyzing ROC and Cost/Benefit plots. The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess the value of the probability (varying from 0 to 1) that a given patient with particular characteristic can be predicted as negative or sick. By moving the threshold from 0 to 1 and imposing that an outcome can be predicted as sick if the corresponding probability exceeds the current threshold, one can build the ROC (Receiver Operating Characteristic) curve. Extra exercise for additional practice with the ROC Curve: Visualize the ROC curve by clicking the right mouse button on the model type bayes.naivebayes in the Result list frame and selecting the menu item Visualize threshold curve / active. The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the threshold. The colder (closer to the blue) color corresponds to the lower threshold value. All outcomes with probability of being sick exceeding the current threshold are predicted as sick. If such prediction made for a current outcome is correct, then the corresponding outcome is true positive, otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model with such threshold can be used to extract selectively sick outcomes from its mixture with the big number of negative ones. 1
13 In order to find the optimal value of the threshold (or the optimal part of patients to be predicted and diagnosed with the thyroid disease), one can perform the cost/benefit analysis. Close the window with the ROC curve Open the window for the cost/benefit analysis by clicking the right mouse button on the model type bayes.naivebayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active. Click on the Minimize Cost/Benefit button at the right bottom corner of the window. 1
14 Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the Threshold curve corresponds to the part of selected instances (the Sample Size ). In other words, the Threshold curve depicts the dependence of the part of diseased patients retrieved in the course of predicting selected from the whole dataset (ie only those selected for which the estimated probability of having thyroid disease exceeds the chosen threshold). The value of the threshold can be modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The confusion matrix for the current value of the threshold is shown in the Confusion Matrix frame at the left bottom corner of the window. Pay attention that the confusion matrix for the current value of the threshold will sharply differ from the previously obtained one. Why is this happening? In order to give an answer to this question and explain the corresponding phenomenon, let us take a look at the right side of the window. Its right bottom corner contains the Cost Matrix frame. The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should pay for decisions taken on the base of the classification model. The cost values are expressed in the table in abstract units, however in the case studies they can be considered in money scale, for example, in US Dollars. The left bottom cell of the Cost matrix defines the cost of false positives. Its default value is 1 unit. In the case Thyroid disease this corresponds to the mean price one should pay in order to treat a patient wrongly predicted by the model as sick. The right top cell of the Cost matrix defines the cost of false negatives. Its default value is 1 unit. In the case of thyroid disease this corresponds to the mean price one should pay for throwing away a sick patient and losing a successful treatment because of the wrong prediction taken by the classification model. It is also taken by default that one should not pay price for correct decision taken using the classification model. It is clear that all these settings can be changed in order to match the real situation taking place in the process of problem at hand. In order to find the threshold corresponding to the minimum cost, it is sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the second confusion matrix results from the value of the threshold found by minimizing the cost function. The current value of the cost is compared by the program with the cost of selecting the same number of 14
15 instances at random. The difference between the values of the cost function between the random selection and the current value of the cost is called Gain, indicated at the right side of the frame. In the context of thyroid disease, the Gain can be interpreted as the benefit obtained by using the classification model instead of random selection of the same number of patients. Unfortunately, the current version of the Weka software does not provide the means of automatic maximization of the Gain function. However, this can easily be done interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. Close the window with the Cost/Benefit Analysis. Extra exercise for additional practice with the Discretize filter: Set the classifier to Naive Bayes. Select Cross-Validation from the test options, run the test and record the accuracy. Discretize task 5 (DT5): Run the same test as in the (DT2) task, but using the unsupervised Discretize filter. Run this test three times, changing the bins setting from 5 to 10 to 20. Record the accuracy of each test result. Discretize task 6 (DT6): Compare all 5 accuracy readings. Which test had the highest accuracy rate? What can you say about discretization in general? Supervised discretization versus unsupervised discretization? What difference does the size of the bins make using unsupervised discretization? Building a Classification Tree Model In this exercise, we build a classification tree model (using the Decision Tree method named in Weka as J48) for predicting the thyroid disease diagnosis of the patients. The goal is to learn the possibilities offered by the Weka software to build and visualize classification trees. In the classifier frame, click Choose, then select the J48 method from the trees submenu. Click on the Start button. 15
16 The statistical parameters of the J48 model appears quite high in this case, while added bonus is the strength of the individual classification tree stems is their interpretation ability. In order to visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up. In order to obtain more usual representation of the same tree, do the following. Click the right mouse button on the model type trees.j48 in the Result list frame and select the menu item Visualize tree. Resize a new window with graphical representation of the tree 16
17 Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit to screen. The Tree View graphical diagram can be used to visualize decision trees. It contains two types of nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a feature depicted by the specified fingerprint bit number. If the answer is yes, then the node connected with the previous one with the = on branch is queried next. Otherwise, the = off branch is activated. The tree top node is queried the first. The leaves of the tree, depicted by rectangular, contain final decisions, whether the current compound is active or not. Extra Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model. 17
18 Final Exercise: Compare the evaluations of all of the different Methods you ran in this exercise. Which one would you use for this particular problem and why? 18
19 UCI Machine Learning Repository APPENDIX Thyroid Disease Data Set Download: Data Folder, Data Set Description Abstract: 10 separate databases from Garavan Institute Data Set Characteristics: Multivariate, Domain- Theory Number of Instances: 7200 Area: Life Attribute Characteristics: Categorical, Real Number of Attributes: 21 Date Donated Associated Tasks: Classification Missing Values? N/A Number of Web Hits: Source: Thyroid disease records supplied by the Garavan Institute and J. RossQuinlan, New South Wales Institute, Syndney, Australia; Data Set Information: # From Garavan Institute # Documentation: as given by Ross Quinlan # 6 databases from the Garavan Institute in Sydney, Australia # Approximately the following for each database: ** 2800 training (data) instances and 972 test instances ** Plenty of missing data ** 29 or so attributes, either Boolean or continuously-valued # 2 additional databases, also from Ross Quinlan, are also here ** Hypothyroid.data and sick-euthyroid.data ** Quinlan believes that these databases have been corrupted ** Their format is highly similar to the other databases # 1 more database of 9172 instances that cover 20 classes, and a related domain theory # Another thyroid database from Stefan Aeberhard ** 3 classes, 215 instances, 5 attributes ** No missing values # A Thyroid database suited for training ANNs ** 3 classes ** 3772 training instances, 3428 testing instances ** Includes cost data (donated by Peter Turney) 19
20 Attribute Information: sick, negative. classes age: continuous. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. goitre: f, t. tumor: f, t. hypopituitary: f, t. psych: f, t. TSH measured: f, t. TSH: continuous. T3 measured: f, t. T3: continuous. TT4 measured: f, t. TT4: continuous. T4U measured: f, t. T4U: continuous. FTI measured: f, t. FTI: continuous. TBG measured: f, t. TBG: continuous. referral source: WEST, STMW, SVHC, SVI, SVHD, other. Num Instances: 3772 Num Attributes: 30 Num Continuous: 7 (Int 1 / Real 6) Num Discrete: 23 Missing values: 6064 /5.4% Relevant Papers: Quinlan,J.R., Compton,P.J., Horn,K.A., & Lazurus,L. (1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia. [Web Link] Quinlan,J.R. (1986). Induction of decision trees. Machine Learning, 1, [Web Link] 20
WEKA Explorer User Guide for Version 3-4-3
WEKA Explorer User Guide for Version 3-4-3 Richard Kirkby Eibe Frank November 9, 2004 c 2002, 2004 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 2 Section Tabs................................
WEKA KnowledgeFlow Tutorial for Version 3-5-8
WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Data Mining with SQL Server Data Tools
Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining
COC131 Data Mining - Clustering
COC131 Data Mining - Clustering Martin D. Sykora [email protected] Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
Tutorial Segmentation and Classification
MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Hierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
Data Mining with Weka
Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Decision Support AITS University Administration. Web Intelligence Rich Client 4.1 User Guide
Decision Support AITS University Administration Web Intelligence Rich Client 4.1 User Guide 2 P age Web Intelligence 4.1 User Guide Web Intelligence 4.1 User Guide Contents Getting Started in Web Intelligence
InfiniteInsight 6.5 sp4
End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Navios Quick Reference
Navios Quick Reference Purpose: The purpose of this Quick Reference is to provide a simple step by step outline of the information needed to perform various tasks on the system. We begin with basic tasks
Windows XP Pro: Basics 1
NORTHWEST MISSOURI STATE UNIVERSITY ONLINE USER S GUIDE 2004 Windows XP Pro: Basics 1 Getting on the Northwest Network Getting on the Northwest network is easy with a university-provided PC, which has
LESSON 7: IMPORTING AND VECTORIZING A BITMAP IMAGE
LESSON 7: IMPORTING AND VECTORIZING A BITMAP IMAGE In this lesson we ll learn how to import a bitmap logo, transform it into a vector and perform some editing on the vector to clean it up. The concepts
SAS Analyst for Windows Tutorial
Updated: August 2012 Table of Contents Section 1: Introduction... 3 1.1 About this Document... 3 1.2 Introduction to Version 8 of SAS... 3 Section 2: An Overview of SAS V.8 for Windows... 3 2.1 Navigating
Petrel TIPS&TRICKS from SCM
Petrel TIPS&TRICKS from SCM Knowledge Worth Sharing Histograms and SGS Modeling Histograms are used daily for interpretation, quality control, and modeling in Petrel. This TIPS&TRICKS document briefly
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Appendix 2.1 Tabular and Graphical Methods Using Excel
Appendix 2.1 Tabular and Graphical Methods Using Excel 1 Appendix 2.1 Tabular and Graphical Methods Using Excel The instructions in this section begin by describing the entry of data into an Excel spreadsheet.
K-means Clustering Technique on Search Engine Dataset using Data Mining Tool
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means
TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:
Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap
Using Microsoft Word. Working With Objects
Using Microsoft Word Many Word documents will require elements that were created in programs other than Word, such as the picture to the right. Nontext elements in a document are referred to as Objects
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Instructions for Use. CyAn ADP. High-speed Analyzer. Summit 4.3. 0000050G June 2008. Beckman Coulter, Inc. 4300 N. Harbor Blvd. Fullerton, CA 92835
Instructions for Use CyAn ADP High-speed Analyzer Summit 4.3 0000050G June 2008 Beckman Coulter, Inc. 4300 N. Harbor Blvd. Fullerton, CA 92835 Overview Summit software is a Windows based application that
Fixplot Instruction Manual. (data plotting program)
Fixplot Instruction Manual (data plotting program) MANUAL VERSION2 2004 1 1. Introduction The Fixplot program is a component program of Eyenal that allows the user to plot eye position data collected with
A Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
Oracle Data Mining Hands On Lab
Oracle Data Mining Hands On Lab Material provided by Oracle Corporation Vlamis Software Solutions is one of the most respected training organizations in the Oracle Business Intelligence community because
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Bluetooth Installation
Overview Why Bluetooth? There were good reasons to use Bluetooth for this application. First, we've had customer requests for a way to locate the computer farther from the firearm, on the other side of
MultiExperiment Viewer Quickstart Guide
MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test
Scientific Graphing in Excel 2010
Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
from Larson Text By Susan Miertschin
Decision Tree Data Mining Example from Larson Text By Susan Miertschin 1 Problem The Maximum Miniatures Marketing Department wants to do a targeted mailing gpromoting the Mythic World line of figurines.
ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software.
ModEco Tutorial In this tutorial you will learn how to use the basic features of the ModEco Software. Contents: Getting Started Page 1 Section 1: File and Data Management Page 1 o 1.1: Loading Single Environmental
Tutorial Exercises for the Weka Explorer
CHAPTER Tutorial Exercises for the Weka Explorer 17 The best way to learn about the Explorer interface is simply to use it. This chapter presents a series of tutorial exercises that will help you learn
Snagit 10. Getting Started Guide. March 2010. 2010 TechSmith Corporation. All rights reserved.
Snagit 10 Getting Started Guide March 2010 2010 TechSmith Corporation. All rights reserved. Introduction If you have just a few minutes or want to know just the basics, this is the place to start. This
Drawing a histogram using Excel
Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
paragraph(s). The bottom mark is for all following lines in that paragraph. The rectangle below the marks moves both marks at the same time.
MS Word, Part 3 & 4 Office 2007 Line Numbering Sometimes it can be helpful to have every line numbered. That way, if someone else is reviewing your document they can tell you exactly which lines they have
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
SQL Server 2014 BI. Lab 04. Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server 2014. Jump to the Lab Overview
SQL Server 2014 BI Lab 04 Enhancing an E-Commerce Web Application with Analysis Services Data Mining in SQL Server 2014 Jump to the Lab Overview Terms of Use 2014 Microsoft Corporation. All rights reserved.
PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables
3 Stacked Bar Graph PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD To explore for a relationship between the categories of two discrete variables 3.1 Introduction to the Stacked Bar Graph «As with the simple
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Integrated Accounting System for Mac OS X
Integrated Accounting System for Mac OS X Program version: 6.3 110401 2011 HansaWorld Ireland Limited, Dublin, Ireland Preface Standard Accounts is a powerful accounting system for Mac OS X. Text in square
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Accountable Care Organization Quality Explorer. Quick Start Guide
Accountable Care Organization Quality Explorer Quick Start Guide 1 P age Background HealthLandscape (a division of the American Academy of Family Physicians [AAFP]) and the Robert Graham Center for Policy
Contents WEKA Microsoft SQL Database
WEKA User Manual Contents WEKA Introduction 3 Background information. 3 Installation. 3 Where to get WEKA... 3 Downloading Information... 3 Opening the program.. 4 Chooser Menu. 4-6 Preprocessing... 6-7
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Systems Dynamics Using Vensim Personal Learning Edition (PLE) Download Vensim PLE at http://vensim.com/freedownload.html
Systems Dynamics Using Personal Learning Edition (PLE) Download PLE at http://vensim.com/freedownload.html Quick Start Tutorial Preliminaries PLE is software designed for modeling one or more quantities
Quick Help Guide (via SRX-Pro Remote)
Quick Help Guide (via SRX-Pro Remote) 2012 i³ International Inc. The contents of this user manual are protected under copyright and computer program laws. Page 2 SRX-Pro Remote - Quick Help Guide Logging
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
IT462 Lab 5: Clustering with MS SQL Server
IT462 Lab 5: Clustering with MS SQL Server This lab should give you the chance to practice some of the data mining techniques you've learned in class. Preliminaries: For this lab, you will use the SQL
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Using the SAS Enterprise Guide (Version 4.2)
2011-2012 Using the SAS Enterprise Guide (Version 4.2) Table of Contents Overview of the User Interface... 1 Navigating the Initial Contents of the Workspace... 3 Useful Pull-Down Menus... 3 Working with
SECTION 5: Finalizing Your Workbook
SECTION 5: Finalizing Your Workbook In this section you will learn how to: Protect a workbook Protect a sheet Protect Excel files Unlock cells Use the document inspector Use the compatibility checker Mark
After you complete the survey, compare what you saw on the survey to the actual questions listed below:
Creating a Basic Survey Using Qualtrics Clayton State University has purchased a campus license to Qualtrics. Both faculty and students can use Qualtrics to create surveys that contain many different types
ENVI Classic Tutorial: Classification Methods
ENVI Classic Tutorial: Classification Methods Classification Methods 2 Files Used in this Tutorial 2 Examining a Landsat TM Color Image 3 Reviewing Image Colors 3 Using the Cursor Location/Value 4 Examining
Integrated Invoicing and Debt Management System for Mac OS X
Integrated Invoicing and Debt Management System for Mac OS X Program version: 6.3 110401 2011 HansaWorld Ireland Limited, Dublin, Ireland Preface Standard Invoicing is a powerful invoicing and debt management
Authorware Install Directions for IE in Windows Vista, Windows 7, and Windows 8
Authorware Install Directions for IE in Windows Vista, Windows 7, and Windows 8 1. Read entire document before continuing. 2. Close all browser windows. There should be no websites open. If you are using
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
IBM SPSS Statistics 20 Part 1: Descriptive Statistics
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
FirstClass FAQ's An item is missing from my FirstClass desktop
FirstClass FAQ's An item is missing from my FirstClass desktop Deleted item: If you put a item on your desktop, you can delete it. To determine what kind of item (conference-original, conference-alias,
MetroBoston DataCommon Training
MetroBoston DataCommon Training Whether you are a data novice or an expert researcher, the MetroBoston DataCommon can help you get the information you need to learn more about your community, understand
Visualization of Phylogenetic Trees and Metadata
Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected]
Filtering Email with Microsoft Outlook
Filtering Email with Microsoft Outlook Microsoft Outlook is an email client that can retrieve and send email from various types of mail servers. It includes some advanced functionality that allows you
UCO_SECURE Wireless Connection Guide: Windows 8
1 The UCO_SECURE wireless network uses 802.1x encryption to ensure that your data is secure when it is transmitted wirelessly. This security is not enabled by default on Windows computers. In order to
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Gestation Period as a function of Lifespan
This document will show a number of tricks that can be done in Minitab to make attractive graphs. We work first with the file X:\SOR\24\M\ANIMALS.MTP. This first picture was obtained through Graph Plot.
Smart Connection 9 Element Labels
08 Smart Connection 9 Element Labels This document is part of the documentation for Smart Connection 9 and is an extract from the former Smart Connection 9 User Guide for InDesign. For more information
USER MANUAL SlimComputer
USER MANUAL SlimComputer 1 Contents Contents...2 What is SlimComputer?...2 Introduction...3 The Rating System...3 Buttons on the Main Interface...5 Running the Main Scan...8 Restore...11 Optimizer...14
Internet Explorer 7. Getting Started The Internet Explorer Window. Tabs NEW! Working with the Tab Row. Microsoft QUICK Source
Microsoft QUICK Source Internet Explorer 7 Getting Started The Internet Explorer Window u v w x y { Using the Command Bar The Command Bar contains shortcut buttons for Internet Explorer tools. To expand
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
DeCyder Extended Data Analysis module Version 1.0
GE Healthcare DeCyder Extended Data Analysis module Version 1.0 Module for DeCyder 2D version 6.5 User Manual Contents 1 Introduction 1.1 Introduction... 7 1.2 The DeCyder EDA User Manual... 9 1.3 Getting
Using Excel as a Management Reporting Tool with your Minotaur Data. Exercise 1 Customer Item Profitability Reporting Tool for Management
Using Excel as a Management Reporting Tool with your Minotaur Data with Judith Kirkness These instruction sheets will help you learn: 1. How to export reports from Minotaur to Excel (these instructions
Topographic Change Detection Using CloudCompare Version 1.0
Topographic Change Detection Using CloudCompare Version 1.0 Emily Kleber, Arizona State University Edwin Nissen, Colorado School of Mines J Ramón Arrowsmith, Arizona State University Introduction CloudCompare
Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations
Supervised DNA barcodes species classification: analysis, comparisons and results Emanuel Weitschek, Giulia Fiscon, and Giovanni Felici Citations If you use this procedure please cite: Weitschek E, Fiscon
The VB development environment
2 The VB development environment This chapter explains: l how to create a VB project; l how to manipulate controls and their properties at design-time; l how to run a program; l how to handle a button-click
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery
Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery WorldView-2 is the first commercial high-resolution satellite to provide eight spectral sensors in the visible to near-infrared
ATLAS.ti for Mac OS X Getting Started
ATLAS.ti for Mac OS X Getting Started 2 ATLAS.ti for Mac OS X Getting Started Copyright 2014 by ATLAS.ti Scientific Software Development GmbH, Berlin. All rights reserved. Manual Version: 5.20140918. Updated
