Waffles: A Machine Learning Toolkit



Similar documents
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Web Document Clustering

An Introduction to Data Mining

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Solve The Kd Cup 2010 Challenge

Analysis Tools and Libraries for BigData

Machine Learning with MATLAB David Willingham Application Engineer

How To Understand How Weka Works

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

From Big Data to Smart Data

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Introduction Predictive Analytics Tools: Weka

Data Mining with Weka

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

An Introduction to WEKA. As presented by PACE

Supervised Learning (Big Data Analytics)

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Machine learning for algo trading

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Classification algorithm in Data mining: An Overview

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Analytics on Big Data

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Getting Even More Out of Ensemble Selection

Predicting Student Performance by Using Data Mining Methods for Classification

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining & Data Stream Mining Open Source Tools

Course Description This course will change the way you think about data and its role in business.

How To Predict Web Site Visits

Azure Machine Learning, SQL Data Mining and R

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Role of Neural network in data mining

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

THE COMPARISON OF DATA MINING TOOLS

Data Mining. Nonlinear Classification

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Divvy: Fast and Intuitive Exploratory Data Analysis

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

WEKA Experiences with a Java Open-Source Project

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

HT2015: SC4 Statistical Data Mining and Machine Learning

Experiments in Web Page Classification for Semantic Web

Predicting Students Final GPA Using Decision Trees: A Case Study

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Machine Learning Capacity and Performance Analysis and R

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Data quality in Accounting Information Systems

Final Project Report

How To Use Neural Networks In Data Mining

Active Learning SVM for Blogs recommendation

Studying Auto Insurance Data

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Classification of Learners Using Linear Regression

MS1b Statistical Data Mining

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Social Media Mining. Data Mining Essentials

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Office: LSK 5045 Begin subject: [ISOM3360]...

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Algorithms Part 1. Dejan Sarka

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

NEURAL NETWORKS A Comprehensive Foundation

What s Cooking in KNIME

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Bayesian networks - Time-series models - Apache Spark & Scala

Scalable Developments for Big Data Analytics in Remote Sensing

DATA MINING TECHNIQUES AND APPLICATIONS

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

SVM Ensemble Model for Investment Prediction

Chapter 6. The stacking ensemble approach

Knowledge Discovery from patents using KMX Text Analytics

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

WEKA KnowledgeFlow Tutorial for Version 3-5-8

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Machine Learning using MapReduce

Data Mining Part 5. Prediction

Principles of Data Mining by Hand&Mannila&Smyth

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Introduction to Data Mining

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Improving spam mail filtering using classification algorithms with discretization Filter

Transcription:

Journal of Machine Learning Research 12 (2011) 2383-2387 Submitted 6/10; Revised 3/11; Published 7/11 Waffles: A Machine Learning Toolkit Mike Gashler Department of Computer Science Brigham Young University Provo, UT 84602, USA MIKE@AXON.CS.BYU.EDU Editor: Soeren Sonnenburg Abstract We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source 1. Introduction Although several open source machine learning toolkits already exist (Sonnenburg et al., 2007), many of them implicitly impose requirements regarding how they can be used. For example, some toolkits require a certain platform, language, or virtual machine. Others are designed such that tools can only be connected together with a specific plug-in, filter, or signal/slot architecture. Unfortunately, these interface differences create difficulty for those who have become familiar with a different methodology, and for those who seek to use tools from multiple tookits together. Toolkits that use a graphical interface may be convenient for performing common experiments, but become cumbersome when the user wishes to use a tool in a manner that was not foreseen by the interface designer, or to automate common and repetitive tasks. Waffles is a collection of tools that seek to provide a wide diversity of useful operations in machine learning and related fields without imposing unnecessary process or interface restrictions on the user. This is done by providing simple command-line interface (CLI) tools that perform basic tasks. The CLI is ideal for this purpose because it is well-established, it is available on most common operating systems, and it is accessible through most common programming languages. Since these tools perform operations at a fairly granular level, they can be used in ways not foreseen by the interface designer. As an example, consider an experiment involving the following seven steps: 1. Use cross-validation to evaluate the accuracy of a bagging ensemble of one-hundred decision trees for classifying the lymph data set (available at http://mldata.org). 2. Separate this data set into a matrix of input-features and a matrix of output-labels. 3. Convert input-features to real-valued vectors by representing each nominal attribute as a categorical distribution over possible values. 4. Use principal component analysis to reduce the dimensionality of the feature-vectors. 2011 Michael Gashler.

GASHLER 5. Use cross-validation to evaluate the accuracy of the same model on the data with reduced features. 6. Train the model using all of the reduced-dimensional data. 7. Visualize the model-space represented by the ensemble. These seven operations can be performed with Waffles tools using the following CLI commands: 1. waffles learn crossvalidate lymph.arff bag 100 decisiontree end 2. waffles transform dropcolumns lymph.arff 18 > features.arff waffles transform dropcolumns lymph.arff 0-17 > labels.arff 3. waffles transform nominaltocat features.arff > f real.arff 4. waffles dimred pca f real.arff 2 > f reduced.arff 5. waffles transform mergehoriz f reduced.arff labels.arff > all.arff waffles learn crossvalidate all.arff bag 100 decisiontree end 6. waffles learn train all.arff bag 100 decisiontree end > ensemble.model 7. waffles plot model ensemble.model all.arff 0 1 The cross-validation performed in step 1 returns a predictive accuracy score of 0.781. Step 5 returns a predictive accuracy score of 0.705. The plot generated by step 7 is shown in Figure 1. It is certainly conceivable that a graphical interface could be developed that would make it easy to perform an experiment like this one. Such an interface might even provide some mechanism to automatically perform the same experiment over an array of data sets, and using an array of different models. If, however, the user needs to vary a parameter specific to the experiment, such as the number of principal components, or a model-specific parameter, such as the number of trees in the ensemble, the benefits of a graphical interface are quickly overcome by additional complexity. By contrast, a simple script that calls CLI commands to perform machine learning operations can be directly modified to vary any of the parameters. Additionally, the scripting method can incorporate tools from other toolkits, or even custom-developed tools. Because nearly all programming languages can target CLI applications, there are few barriers to adding custom operations. Graphical tools are unlikely to offer such flexibility. Figure 1: The model-space visualization generated by the command in step 7. 2384

WAFFLES: A MACHINE LEARNING TOOLKIT Figure 2: A partial screen shot of the Waffles Wizard tool displayed in a web browser. 2. Wizard One significant reason many people prefer to use tools unified within a graphical interface over scriptable CLI tools is that it can be cumbersome to remember which options are available with CLI tools, and to remember how to construct a syntactically-correct command. We solve this problem by providing a Wizard tool that guides the user through a series of forms to construct a command that will perform the desired task. A screen shot of this tool (displayed in a web browser) is shown in Figure 2. Rather than execute the selected operation directly, as most GUI tools do, the Waffles Wizard tool merely displays the CLI command that will perform the operation. The user may paste it directly into a command shell to perform the operation immediately, or the user may choose to incorporate it into a script. This gives the user the benefits of a GUI, without the undesirable tendency to lock the user into an interface that is inflexible for scripted automation. 3. Capabilities In order to hilight the capabilities of Waffles, we compare its functionality with that found in Weka (Hall et al., 2009), which at the time of this writing is the most popular machine learning toolkit by a significant margin. Our intent is not to persuade the reader to choose Waffles instead of Weka, but rather to show that many useful capabilities can be gained by using Waffles in conjunction with Weka, and other toolkits that offer a CLI. One notable strength of Waffles is in unsupervised algorithms, particularly dimensionality reduction techniques. Waffles tools implement principal component analysis (PCA), isomap (Tenenbaum et al., 2000), locally linear embedding (Roweis and Saul, 2000), manifold sculpting (Gashler et al., 2011a), breadth-first unfolding, neuro-pca, cycle-cut (Gashler et al., 2011b), unsupervised backpropagation and temporal nonlinear dimensionality reduction (Gashler et al., 2011c). Of these, only PCA is found in Weka. Waffles contains clustering techniques including k-means, k-medoids, agglomerative clustering, and related transduction algorithms including agglomerative transduction, and max-flow/min-cut transduction (Blum and Chawla, 2001). Waffles provides some of the most-common supervised learning techniques, such as decision trees, multi-layer neural networks, k-nearest neighbor, naive Bayes, and some less-common algorithms, such as Mean-margin trees (Gashler et al., 2008). Waffles collection of supervised algo- 2385

GASHLER rithms is much smaller than that of Weka, which implements more than 50 classification algorithms. Waffles, however, provides an interface that offers several advantages in many situations. For example, Weka requires the user to set up filters that convert data to types that each algorithm can handle. Waffles automatically handles type conversion when an algorithm receives a type that it is not implicitly designed to handle, while still permitting advanced users to specify custom filters. The Waffles algorithms also implicitly handle multi-dimensional labels. As some algorithm-specific examples, the Waffles implementation of multi-layer perceptron provides the ability to use a diversity of activation functions, and also supplies methods for training recurrent networks. The k-nearest neighbor algorithm automatically supports acceleration structures and sparse training data, so it is suitable for use with problems that require high scalability, such as document classification. As was demonstrated in the first example in this paper, Waffles features a particularly convenient mechanism for creating bagging ensembles. It also provides a diversity of collaborative filtering algorithms and optimization techniques that are not found in Weka. Waffles also provides tools to perform linear-algebraic operations, and various data-mining tools, including attribute selection and several methods for visualization. 4. Architecture The Waffles tools are organized into several executable applications. These include: 1. waffles wizard, a graphical command-building assistant, 2. waffles learn, a collection of supervised learning techniques and algorithms, 3. waffles transform, a collection of unsupervised data transformations, 4. waffles plot, tools related to visualization, 5. waffles dimred, tools for dimensionality reduction and attribute selection, 6. waffles cluster, tools for clustering data, 7. waffles generate, tools for sampling distributions, manifolds, etc., 8. waffles recommend, tools related to collaborative filtering, and 9. waffles sparse, tools for learning with sparse matrices. Each tool contained in each of these applications is implemented as a thin wrapper around functionality in a C++ class library, called GClasses. This library is included with Waffles so that any of the functionality available in the Waffles CLI tools can also be linked into C++ applications, or into applications developed in other languages that are capable of linking with C++ libraries. The entire Waffles project is licensed under the GNU Lesser General Public License (LGPL) version 2.1, and also later versions of the LGPL (http://www.gnu.org/licenses/lgpl.html). Also, some components are additionally granted more permissive licenses. Waffles uses a minimal set of dependency libraries, and is carefully designed to support cross-platform compatibility. It builds on Linux (with g++), Windows (with Visual C++ Express Edition), OSX (with g++), and most other common platforms. A new version of Waffles has been released approximately every six months since it was first released to the public in 2005. The latest version can be downloaded from http: //waffles.sourceforge.net. Full documentation for the CLI tools, including many examples, and also documentation for developers seeking to link with the GClasses library can also be found at that site. In order to augment the developer documentation, several demo applications are also included with Waffles, showing how to build machine learning tools that link with functionality in the GClasses library. 2386

WAFFLES: A MACHINE LEARNING TOOLKIT References A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 19 26. Morgan Kaufmann Publishers Inc., 2001. ISBN 1558607781. M. Gashler, C. Giraud-Carrier, and T. Martinez. Decision tree ensemble: Small heterogeneous is better than large homogeneous. In Seventh International Conference on Machine Learning and Applications, 2008. ICMLA 08., pages 900 905. Dec. 2008. doi: 10.1109/ICMLA.2008.154. M. Gashler, D. Ventura, and T. Martinez. Manifold learning by graduated optimization. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, PP(99):1 13, 2011a. ISSN 1083-4419. doi: 10.1109/TSMCB.2011.2151187. M. Gashler, D. Ventura, and T. Martinez. Tangent space guided intelligent neighbor finding. In The 2011 International Joint Conference on Neural Networks (IJCNN), July 2011b. M. Gashler, D. Ventura, and T. Martinez. Temporal nonlinear dimensionality reduction. In The 2011 International Joint Conference on Neural Networks (IJCNN), July 2011c. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10 18, 2009. ISSN 1931-0145. doi: 10. 1145/1656274.1656278. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323 2326, 2000. S. Sonnenburg, M.L. Braun, C.S. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K.R. Müller, F. Pereira, C.E. Rasmussen, et al. The need for open source software in machine learning. Journal of Machine Learning Research, 8:2443 2466, 2007. J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319 2323, 2000. 2387