An overview of free software tools for general data mining
|
|
|
- Jeffrey Shelton
- 10 years ago
- Views:
Transcription
1 An overview of free software tools for general data A. Jović *, K. Brkić * and N. Bogunović * * Faculty of Electrical Engineering and Computing, University of Zagreb / Department of Electronics, Microelectronics, Computer and Intelligent Systems, Unska 3, Zagreb, Croatia {alan.jovic, karla.brkic, nikola.bogunovic}@fer.hr Abstract - This expert paper describes the characteristics of six most used free software tools for general data that are available today: RapidMiner, R, Weka, KNIME, Orange, and scikit-learn. The goal is to provide the interested researcher with all the important pros and cons regarding the use of a particular tool. A comparison of the implemented algorithms covering all areas of data (classification, regression,, associative rules, feature selection, evaluation criteria, visualization, etc.) is provided. In addition, the tools' support for the more advanced and specialized research topics (big data, data streams, text, etc.) is outlined, where applicable. The tools are also compared with respect to the community support, based on the available sources. This multidimensional overview in the form of expert paper on data tools emphasizes the quality of RapidMiner, R, Weka, and KNIME platforms, but also acknowledges the significant advancements made in the other tools. I. INTRODUCTION Data (DM) is the core step in knowledge discovery in datasets. It integrates all the analysis procedures that are required in order to reveal new and relevant information to an interested user. DM includes data preparation and data modeling. Datasets may be obtained from a variety of sources, including: traditional relational databases, data warehouses, web documents, or simple local textual files. It is important to prepare data in the most efficient way in order to extract as much information as possible [1]. After preparation, various models can be constructed, depending on the research goal. In order to properly interpret the models, standard evaluation and statistical procedures are pursued. The visualization of the results is also encouraged. Free and publicly available software tools for DM have been in development for the past 20 years. The goal of these tools is to facilitate the rather complicated data analysis process and to offer all interested researchers a free alternative to commercial data analysis platforms. They do so mainly by proposing integrated environments or specialized packages on top of standard programming languages, which are often open source. This paper focuses on the review of several such tools that have grown more efficient and useful over the years, some even comparable or better in certain aspects than their commercial counterparts. In particular, the main characteristics of RapidMiner [2], R [3], Weka [4], Orange [5], KNIME [6], and scikit-learn [7] will be outlined and compared. All of these free tools have implementations of general DM tasks. Other, more specialized tools such as Elki () or Anatella (big data), and tools with small DM community support (e.g. F#, GNU Octave) are not considered here. The tools are compared in terms of general properties (language, license, etc.), core DM tasks (supported input, data preparation, data analysis, output), and support for some recent more advanced DM topics (big data, data streams, text, deep ). II. FREE DATA MINING TOOLS OVERVIEW In a recent, 2013, poll published on the influential KDnuggets portal [8], regarding the use of DM tools in a real project, it is interesting to observe that in the top 5 tools there is only one commercial tool: Excel. The domination of free tools, mainly RapidMiner and R, probably stems from the maturity and availability of a large number of machine algorithm implementations. In Fig. 1, an adapted version of the poll is shown, with only free tools listed, for comparison. Most of the modern DM tools are effectively softwarebased dataflow architectures. Some of the tools (e.g. RapidMiner, KNIME) are graphical integrated environments that enable visual component placement, connection, and dragging. The most common paradigm for such components is: pull the data in, transform, and push it further down the pipeline. The component only pulls the data in once all of its prerequisites are met. The under-the-hood implementation of such components is irrelevant for the average user, but more advanced users usually have the possibility of tackling with the code in order to improve it. Other tools (e.g. R) are just plain extensions of the underlying language in the form of specialized packages and/or GUI add-ons. General characteristics of the six DM tools are listed in Table I. All of the tools have implementations for Windows, Linux, and Mac OS X operating systems. In the large Table II, the supported machine algorithms for each DM tool are summarized. The tools either implement an algorithm (+), use an external add-on (A) to support it, show some degree of support for the procedure (S), or do not implement it (-) at all. It must be noted here that the data in Table II should be considered temporary, because most of the tools are in constant state of upgrades. Nevertheless, it is important and useful to summarize their
2 TABLE I. GENERAL CHARACTERISTICS OF THE FREE DATA MINING TOOLS Characteristic RapidMiner R Weka Orange KNIME scikit-learn RapidMiner, worldwide Univ. of Waikato, Univ. of Ljubljana, KNIME.com multiple; support: Developer: Germany development New Zealand Slovenia AG,Switzerland INRIA, Google Programming C++, Python, Python+NumPy+ Java C, Fortran, R Java Java language: Qt framew. SciPy+matplotlib License: Current version: GUI / command line: Main purpose: Community support (est.): open s. (v.5 or lower); closed s., free Starter ed. (v.6) free software, GNU GPL 2+ open source, GNU GPL 3 open source, GNU GPL 3 open source, GNU GPL 3 FreeBSD GUI general data large (~ users) both; (GUI for DM = Rattle) sci. computation and statistics very large (~ 2 M users) both both GUI command line general data large general data moderate general data moderate (~ users) machine package add-on moderate Figure 1. The community use of free DM tools, adapted from [8] capabilities so that interested users can choose the appropriate environment for handling their problem. The algorithms shown in Table II were selected based on their significance and presence in most of the tools. Although the tools may contain some additional algorithms, not all of them could have been listed. Some of these were put in the appropriate rows with label. References to the work that describe the algorithms are not provided. Table III lists the support for some of the more specialized topics in DM such as big data, text, etc. III. TOOLS DESCRIPTION A. RapidMiner RapidMiner (previously: Rapid-I, YALE) is a mature, Java-based, general DM tool currently in development by the company RapidMiner, Germany. Previous versions (v. 5 or lower) were open source. The latest one (v. 6) is proprietary for now, with several license options (Starter, Personal, Professional, Enterprise). The Starter version is free with limitations only in respect to maximum allocated memory size (1 GB) and input files (.csv, Excel). The tool has become very popular in several recent years and has a large community support. RapidMiner offers an integrating environment with visually appealing and user-friendly GUI. Everything in RapidMiner is focused on processes that may contain subprocesses. Processes contain operators in the form of visual components. Operators are implementations of DM algorithms, data sources, and data sinks. The dataflow is constructed by drag-and-drop of operators and by connecting the inputs and outputs of corresponding operators. RapidMiner also offers the option of application wizards that construct the process automatically based on the required project goals (e.g. direct marketing, churn analysis, sentiment analysis). There are tutorials available for many specific tasks so the tool has a stable curve. Although RapidMiner is quite powerful with its basic set of operators, it is the extensions that make it even more useful. Popular extensions include sets of operators for text, web, time series analysis, etc. Most of the operators from Weka are also available through extension, which increases the number of implemented DM methods (Table II). The tool has very few shortcomings. The most important one is the transition to a novel model of business. It remains to be seen whether the transition to proprietary license will limit the number of its users, but it may not be helpful. The support for deep methods and some of the more advanced specific machine algorithms (e.g. extremely randomized trees, various inductive logic programming algorithms) is currently limited. However, big data analysis via Hadoop cluster (Radoop) is supported. B. Weka Weka is a Java-based, open-source DM platform developed at the University of Waikato, New Zealand. The software is free under GNU GPL 3 for noncommercial purposes. Weka has had mostly stable
3 TABLE II. DATA MINING ALGORITHMS AND PROCEDURES SUPPORTED BY THE TOOLS Category Name RapidMiner R Weka Orange KNIME textual files (.txt,.csv) specific input format files + (e.g..arff,.xrff) A (foreign) + (.arff,.libsvm) + + Data import Excel/ spreadsheet + A (xlsx) - - A database table + A (RODBC) + + (prototype) + data from an URL Feature selection filters + A (FSelector) wrappers + A (FSelector) discretization + A ( normalization + A ( Feature transformation PCA ICA + + (fastica) MDS SVD A - random projections ID Decision tree learner Classification rules Bayesian networks Instance based Function based C4.5 A ( CART A ( + + +, A (own*, dec. stump) +, A (own*, + (dec. stump) + (own*) + (own*) 1Rule + A ( + - PART A ( + - RIPPER + A ( (subgroup discovery) A ( + (Ridor) + (CN2, subgroup discovery) Naive Bayes full bayesian network AODE A (AnDE) knn + A (class, + (DMNB, WAODE) LWL A (stats) + - regression analysis ANN SVM e.g. LBR + (log., lin., polynomial) + (MLP) + (linear, evolut., PSO) + (Gaussian proc., RVM) A ( e.g. LBR + (lin., nonlin.) + (log, lin.) A (nnet, RSNNS) A (e1071, + (LBR) + (ML-kNN) + (log., lin., Lasso, PLS, trees, mean) +(log., lin., poly, trees) + (MLP) MLP MLP, PNN + (SMO), A (libsvm) + (libsvm, liblinear) + (integr. libsvm) logistic model trees A ( + - Hybrid methods function trees Bayesian trees (DTNB), A (CBA) - Ensemble bagging + A ( + +
4 Category Name RapidMiner R Weka Orange KNIME AdaBoost + A ( + + random forest + extremely randomized trees A (random Forest) (extra Trees) rotation forest S + - LogitBoost + A ( + - option tree Hierarchical stacking + A ( + - other + (Bayesian boosting) + (boosted regression) + (Multi Boost) - linkage-based AGNES - A (cluster) BIRCH - A (birch) k-means Centroid (partition) based Distribution based Density based ANN based Association rules (unsupervised) X-means + A ( + - Partition Around Medoids fuzzy c-means - A (cluster) - A A - A (e1071) - A + EM + A (mclust) + - DBSCAN + A (fpc) + - OPTICS Self-organising map + A(RSNNS) A + GSP Apriori A (arules, + + (sparse, attr.- value) FP-growth Eclat - A (arules) Tertius A ( A (arules Sequences) + (predictive Apriori) + (Apriori-SD) holdout Evaluation methods and metrics cross-validation + A (cvtools) classif. accur., mean abs.err., MSE, κ TP, FP, FN, TN conf. matrix, precis., recall ROC, Lift chart, Cost-benefit + A (ROCR) A (ROCR) A (proc) histograms Data visualization scatterplots other plots (e.g. boxwhisker, mean/error) 3D graphs (e.g. bivar. hist., surface plots) * Own implementation of decision tree, mostly similar to C4.5 and CART S A (lattice) - - S (scatterplot3d) popularity over the years, which is mainly due to its user friendliness and the availability of a large number of implemented DM algorithms. It is still not as popular as RapidMiner or R, both in business and academic circles, mostly because of some slow and more resource demanding implementations of DM algorithms. Although it is not a single tool of choice in DM, Weka is still quite powerful and versatile, and has a large community support. Weka offers four options for DM: command-line interface (CLI), Explorer, Experimenter, and Knowledge flow. The preferred option is the Explorer which allows
5 TABLE III. SUPPORT FOR SPECIALIZED AND ADVANCED DATA MINING TASKS Name RapidMiner R Weka Orange KNIME scikit-learn Big data S (not free: S (CLI, knowl. flow, A (ff, ffbase) Radoop) distributedwekahadoop) - A S Link, graph - A (igraph, sna) A - A - Spatial data analysis - A (ggmap) - - A S Time-series A +, A(forecast) S (several time series filters) - + S (timeseries analysis Semi-super-vised S A (upclass) S - S module has bugs) + (label propagation) Data streams + A (stream) A (massiveonlineanalysis) - + S Text Paralelization A S (enterprise ed.) Deep - A (tm, RTextTools, qdap) S A A + A (snow, multicore) S - + A (joblib) S (darch: incomplete) S (Restricted Boltzmann Mach.) the definition of data source, data preparation, machine algorithms, and visualization. The Experimenter is used mainly for comparison of the performance of different algorithms on the same dataset. Knowledge flow is akin to RapidMiner's operator paradigm in a sense that it allows one to specify the dataflow using appropriately connected visual components. Although not as visually appealing and extendable as RapidMiner, Weka's Knowledge flow still does the job. Weka supports many model evaluation procedures and metrics, but lacks many data survey and visualization methods. It is also more oriented towards classification and regression problems and less towards descriptive statistics and methods, although some improvements were made recently with respect to. The support for big data, text, and semi-supervised is currently limited, while deep methods are still not considered. C. R The open-source tool and programming language of choice for statisticians, R, is also a strong option for DM tasks. R has been in development for the last 15 years and is the successor of S, a statistical language originally developed by Bell Labs in 1970s. The source code of R is written in C++, Fortran, and in R itself. It is an interpreted language and is mostly optimized for matrix based calculations, comparable in performance to commercially available Matlab and freely available GNU Octave. The main language is extended by a myriad collection of packages for all sorts of computational tasks. Some of the packages are listed in Table II in parentheses. The tool offers only a simple GUI with command-line shell for input. It is certainly not a user-friendly environment because all commands need to be entered in the R language. The curve is steep, and although simple tasks such as drawing graphs and descriptive statistics can be learned rather easy, the language's full potential is difficult to master. Some advanced users sometimes offer helpful reference cards with the list of the most significant functions [9]. From DM user's perspective, R offers very fast implementations of many machine algorithms, comparable in number to RapidMiner and Weka (from which a large number of algorithms is borrowed), and also the full prospect of statistical data visualizations methods. It has specific data types for handling big data, supports parallelization, data streams, web, graph, spatial, and many other advanced tasks, including a limited support for deep methods. R's main problem is its language, which, although highly extendable, is also a difficult one to learn thoroughly enough to become productive in DM. Advancement for DM tasks in that direction is the Rattle package (author Dr. Graham Williams and other contributors) that offers a decent GUI for R. Rattle, which is in development from 2006, is similar to Weka's Explorer in the sense of user-friendliness. It loads separate packages from R upon request for a specific analysis. Rattle uses some of the R's standard implementations of DM methods and also additional packages. The only problem with Rattle is that it cannot use all of the R's algorithms or Weka's DM implementations. Nevertheless, Rattle is user-friendly and quite popular in DM community. D. Orange Orange is a Python-based tool for DM being developed at the Bioinformatics Laboratory of the Faculty of Computer and Information Science at the University of Ljubljana. It can be used either through Python scripting as a Python plug-in, or through visual programming. Its visual programming interface, Orange Canvas, offers a structured view of supported functionalities grouped into nine categories: data operations, visualization, classification, regression, evaluation, unsupervised, association, visualization using Qt, and prototype implementations. Functionalities are visually represented by different widgets (e.g. read file, discretize, train SVM classifier etc). A short description of each widget is available within the interface. Programming is performed by placing
6 widgets on the canvas and connecting their inputs and outputs. The interface is very polished and visually appealing, offering a pleasant user experience. One apparent downside of Orange is that the number of available widgets seems limited when compared to other tools such as RapidMiner or KNIME, especially because of the lack of integration with Weka. Still, the coverage of standard data techniques is quite good, as can be seen from Table II. Furthermore, there are a number of interesting widgets currently in development that can be found in the "Prototype" category, so it is reasonable to expect that the feature set will be expanded in the future. E. scikit-learn scikit-learn is a free package in Python that extends the functionality of NumPy and SciPy packages with numerous DM algorithms. It also uses the matplotlib package for plotting charts. The package keeps improving by accepting valuable contributions from many contributors and is supported by both INRIA and Google Summer of Code. One of its main strong points is a wellwritten online documentation for all of the implemented algorithms. Well-written documentation is a requirement for any contributor and is valued more than a lot of poorly documented algorithm implementations. The package supports most of the core DM algorithms. However, several significant DM algorithm groups have been omitted currently, including classification rules and association rules. On the other hand, the package is strong in function-based methods including many general linear models and various types of SVM implementations. It is also quite fast despite being written in an interpreted language. This is mainly because the contributors are asked to optimize the code in various aspects, such as calling array based NumPy number crunching algorithm or writing wrappers for existing C/C++ implementations in Cython. Despite its advantages, the use of scikit-learn still requires that one is a skilled programmer in Python because of its command-line interface. This will detract almost anyone not versed in this language because there are other tools that do not have this assumption. F. KNIME KNIME (Konstanz Information Miner) is a generalpurpose DM tool based on the Eclipse platform, developed and maintained by the Swiss company KNIME.com AG. Its development started in 2004 at the University of Konstanz, Germany, and the initial version was released in KNIME is open-source, though commercial licenses exist for companies requiring professional technical support. According to the official website, KNIME is used by over 3000 organizations in more than 60 countries, and there seems to be a considerable community support. The tool adheres to the visual programming paradigm present in most DM tools, where building blocks are placed on a canvas and connected to obtain a visual program. In KNIME, these building blocks are called nodes, and according to the official website, more than 1000 nodes are available through the core installation and various extensions. Nodes are organized in a hierarchy and can be searched by name within an intuitive interface. Each node is documented in detail, and the documentation is automatically shown within the interface once the node is selected. A large repository of example workflows is available to facilitate quicker of the tool. One of the greatest strengths of KNIME is the integration with Weka and R. Although extensions have to be installed to enable the integration, the installation itself is trivial. Weka integration enables using almost all the functionality available in Weka as KNIME nodes, while R integration enables running R code as a step in the workflow, opening R views and models within R. Several other interesting free extensions are also available, e.g. JFreeChart extension that enables advanced charting, OpenStreetMap extension that enables working with geographical data, etc. There are also commercial extensions for more specific functionalities. Overall, KNIME seems to be one of the best choices for a user interested in a purely visual programming paradigm with a need for a large variety of nodes. IV. CONCLUSION Several DM tools were presented in this work. Overall conclusion is that there is no single best tool. Each tool has its strong points and weaknesses. Nevertheless, RapidMiner, R, Weka, and KNIME have most of the desired characteristics for a fully-functional DM platform and therefore their use can be recommended for most of the DM tasks. REFERENCES [1] D. Pyle, Data Preparation for Data Mining, San Diego: Academic Press, [2] M. Hofmann and R. Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, Boca Raton: CRC Press, [3] Y. Zhao, R and Data Mining: Examples and Case Studies, San Diego: Academic Press, [4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The WEKA data software: an update, SIGKDD Explorations, vol. 11, no. 1, pp , [5] J. Demšar, T. Curk, and A. Erjavec, Orange: Data Mining Toolbox in Python, Journal of Machine Learning Research, vol. 14, pp , [6] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, et al., KNIME: The Konstanz Information Miner, in Data Analysis, Machine Learning and Applications (Studies in Classification, Data Analysis, and Knowledge Organization), Springer Berlin Heidelberg, pp , [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp , [8] G. Piatetsky, KDnuggets Annual Software Poll: RapidMiner and R vie for first place, 2013, Available at [last accessed ]: [9] Y. Zhao, R Reference Card for Data Mining, Available at: [last accessed ]
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
DATA MINING ALPHA MINER
DATA MINING ALPHA MINER AlphaMiner is developed by the E-Business Technology Institute (ETI) of the University of Hong Kong under the support from the Innovation and Technology Fund (ITF) of the Government
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
PARAMETRIC COMPARISON OF DATA MINING TOOLS
PARAMETRIC COMPARISON OF DATA MINING TOOLS Neha Chauhan 1, Nisha Gautam 2 1 Student of Master of Technology, 2 Assistant Professor, Department of Computer Science and Engineering, AP Goyal Shimla University,
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Orange: Data Mining Toolbox in Python
Journal of Machine Learning Research 14 (2013) 2349-2353 Submitted 3/13; Published 8/13 Orange: Data Mining Toolbox in Python Janez Demšar Tomaž Curk Aleš Erjavec Črt Gorup Tomaž Hočevar Mitar Milutinovič
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
The Prophecy-Prototype of Prediction modeling tool
The Prophecy-Prototype of Prediction modeling tool Ms. Ashwini Dalvi 1, Ms. Dhvni K.Shah 2, Ms. Rujul B.Desai 3, Ms. Shraddha M.Vora 4, Mr. Vaibhav G.Tailor 5 Department of Information Technology, Mumbai
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
WEKA KnowledgeFlow Tutorial for Version 3-5-8
WEKA KnowledgeFlow Tutorial for Version 3-5-8 Mark Hall Peter Reutemann July 14, 2008 c 2008 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7
DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Open Source Software Tools for Anomaly Detection Analysis
Open Source Software Tools for Anomaly Detection Analysis by Robert F. Erbacher and Robinson Pino ARL-MR-0869 April 2014 Approved for public release; distribution unlimited. NOTICES Disclaimers The findings
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Waffles: A Machine Learning Toolkit
Journal of Machine Learning Research 12 (2011) 2383-2387 Submitted 6/10; Revised 3/11; Published 7/11 Waffles: A Machine Learning Toolkit Mike Gashler Department of Computer Science Brigham Young University
Data Analysis in E-Learning System of Gunadarma University by Using Knime
Data Analysis in E-Learning System of Gunadarma University by Using Knime Dian Kusuma Ningtyas tyaz tyaz [email protected] Prasetiyo [email protected] Farah Virnawati virtha
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.
1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations
KNIME Enterprise server usage and global deployment at NIBR
KNIME Enterprise server usage and global deployment at NIBR Gregory Landrum, Ph.D. NIBR Informatics Novartis Institutes for BioMedical Research, Basel 8 th KNIME Users Group Meeting Berlin, 26 February
Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.
Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014
Machine Learning in Python with scikit-learn O Reilly Webcast Aug. 2014 Outline Machine Learning refresher scikit-learn How the project is structured Some improvements released in 0.15 Ongoing work for
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Predicting Students Final GPA Using Decision Trees: A Case Study
Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
A Comparative study of Techniques in Data Mining
A Comparative study of Techniques in Data Mining Manika Verma 1, Dr. Devarshi Mehta 2 1 Asst. Professor, Department of Computer Science, Kadi Sarva Vishwavidyalaya, Gandhinagar, India 2 Associate Professor
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction
ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction Barun Paudel 1, T.H. Gopaluwewa 1, M.R.De. Waas Gunawardena 1, W.C.H. Wijerathna 1, Rohan Samarasinghe
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis
DBTechNet DBTech Pro Workshop Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining Dimitris A. Dervos [email protected] http://aetos.it.teithe.gr/~dad Georgios Evangelidis
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
How To Predict Web Site Visits
Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many
Knowledge Discovery in Data with FIT-Miner
Knowledge Discovery in Data with FIT-Miner Michal Šebek, Martin Hlosta and Jaroslav Zendulka Faculty of Information Technology, Brno University of Technology, Božetěchova 2, Brno {isebek,ihlosta,zendulka}@fit.vutbr.cz
Sunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Interactive Data Mining and Visualization
Interactive Data Mining and Visualization Zhitao Qiu Abstract: Interactive analysis introduces dynamic changes in Visualization. On another hand, advanced visualization can provide different perspectives
Big-data Analytics: Challenges and Opportunities
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at 台 灣 資 料 科 學 愛 好 者 年 會, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.)
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Introduction Predictive Analytics Tools: Weka, R!
Introduction Predictive Analytics Tools: Weka, R! Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego! Available Data Mining Tools! COTs:! n IBM
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Improving spam mail filtering using classification algorithms with discretization Filter
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Data Mining with Weka
Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Machine Learning. Hands-On for Developers and Technical Professionals
Brochure More information from http://www.researchandmarkets.com/reports/2785739/ Machine Learning. Hands-On for Developers and Technical Professionals Description: Dig deep into the data with a hands-on
THE COMPARISON OF DATA MINING TOOLS
T.C. İSTANBUL KÜLTÜR UNIVERSITY THE COMPARISON OF DATA MINING TOOLS Data Warehouses and Data Mining Yrd.Doç.Dr. Ayça ÇAKMAK PEHLİVANLI Department of Computer Engineering İstanbul Kültür University submitted
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Subject Description Form
Subject Description Form Subject Code Subject Title COMP417 Data Warehousing and Data Mining Techniques in Business and Commerce Credit Value 3 Level 4 Pre-requisite / Co-requisite/ Exclusion Objectives
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Fuzzy Logic in KNIME Modules for Approximate Reasoning
International Journal of Computational Intelligence Systems, Vol. 6, Supplement 1 (2013), 34-45 Fuzzy Logic in KNIME Modules for Approximate Reasoning Michael R. Berthold 1, Bernd Wiswedel 2, and Thomas
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features
Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics [email protected] www.twitter.com/charliedatamine
Data Mining in Education: Data Classification and Decision Tree Approach
Data Mining in Education: Data Classification and Decision Tree Approach Sonali Agarwal, G. N. Pandey, and M. D. Tiwari Abstract Educational organizations are one of the important parts of our society
Comparative Analysis of Classification Algorithms on Different Datasets using WEKA
Volume 54 No13, September 2012 Comparative Analysis of Classification Algorithms on Different Datasets using WEKA Rohit Arora MTech CSE Deptt Hindu College of Engineering Sonepat, Haryana, India Suman
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition
Brochure More information from http://www.researchandmarkets.com/reports/2171322/ Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition Description: This book reviews state-of-the-art methodologies
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015
R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians
