Ensembles and PMML in KNIME
|
|
|
- Reginald Shelton
- 10 years ago
- Views:
Transcription
1 Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected] 2 KNIME.com AG Technoparkstrasse 1 Zurich, Switzerland [email protected] ABSTRACT In this paper we describe how ensembles can be trained, modified and applied in the open source data analysis platform, KNIME. We focus on recent extensions that also allow ensembles, represented in PMML, to be processed. This way ensembles generated in KNIME can be deployed to PMML scoring engines. In addition ensembles created by other tools and represented as PMML can be applied or further processed (modified or filtered) using intuitive KNIME workflows. Categories and Subject Descriptors D.2.6 [Software Engineering]: Programming Environments Graphical Environments, Interactive Environments; D.2.12 [Software Engineering]: Interoperability; H.5.2 [Information Interfaces and Presentation]: User Interfaces Graphical User Interfaces; I.5.5 [Pattern Recognition]: Implementation Interactive Systems General Terms Ensemble, PMML, KNIME Keywords PMML, Ensemble Models, KNIME 1. INTRODUCTION KNIME is a user-friendly and comprehensive open source data integration, processing, analysis and exploration platform [1]. The tool provides a graphical user interface for modeling and documenting complex knowledge discovery processes from data extraction to applying predictive models. KNIME already uses the PMML (Predictive Model Markup Language) standard in many of its nodes. A previous publication [9] introduced KNIME s ability to include preprocessing steps in the PMML model. Various data mining models including for example decision trees, association Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PMML 13, Aug 10, 2013 Chicago, Illinois USA Copyright 2013 ACM analysis, clustering, neural network and regression models can also be generated as PMML in KNIME. PMML is an XML-based exchange format for predictive data mining models [7] and often used to ensure compatibility among different data mining systems. A data mining task defined in one tool can easily be carried out with another if the PMML description of the task is shared; because it is based on XML, it can also easily be edited outside of these tools saw the release of PMML 4.0, which includes support for multiple models in a single PMML document. PMML 4.1, released in 2011, further simplifies the representation of such models. The MiningModel can contain multiple other models. These models can be of different types and have an optional weight and a predicate, which is an expression indicating whether the model should be used or not. The most prominent application of such a model collection is ensemble learning and prediction. Ensembles are built using models from multiple, often weak learners. The individual weak learner might not be very powerful, however combining many of them frequently increases the quality of the final prediction substantially. Two well known ensemble learning methods are bagging [2] and boosting [5]. We show how these can be realized using a number of KNIME nodes. In bagging, models are trained in parallel on different, randomly chosen samples of the data. During the prediction phase all models are used for prediction and the individual predictions are aggregated into a single result. The final prediction is in most cases even more accurate than that of a single, strong predictor. In boosting, models are trained iteratively, this means that the result of the previously built weak learner is included in the next iteration. The KNIME nodes realize the AdaBoost [6] algorithm. Records misclassified in a previous iteration receive a higher weight and are more likely to be chosen in the next iteration. A consequent weak learner will hence put more focus on these records. Starting with version 2.8, KNIME allows ensemble models to be generated and modified in the PMML format. Several new nodes enable models to be collected from multiple learners and inserted into a single ensemble model to be used by a variety of existing data processing nodes. There is now a predictor node for PMML ensemble models in KNIME. In the remainder of this work, first the PMML support in KNIME is outlined. This is followed by an explanation of the ensemble generation process in KNIME as well as details of the new integration of PMML mining models. In the last section we illustrate how externally created mining models can be used and edited.
2 File Reader data PMML Reader Preprocessing: Combine PMML preprocessing steps into a single PMML document import a model Normalizer Numeric Binner Export the PMML document, including preprocessing steps PMML Writer export PMML Number To String Figure 1: An example of how KNIME includes the preprocessing steps in an externally created PMML document. 2. PMML IN KNIME In KNIME, data manipulation steps are modeled as connected nodes. The data flows through a workflow following connections between nodes. Each node can have one or more input and output ports. A port can either represent a data table, flow variables or a model. Flow variables and models are used to pass meta information between nodes. The KNIME platform natively supports PMML documents, as it uses them to document preprocessing steps and pass predictive models between nodes. The PMML Writer and Reader nodes in KNIME not only allow models to be stored for later use but also enable models to be exported for further processing with other data mining products such as the ZEMENTIS ADAPA Decision Engine [8]. Starting with KNIME 2.6 the data preprocessing nodes provide an additional optional input and output port, denoted by blue rectangles. In this PMML port a PMML document can be passed to the node. The passed PMML document is enriched with the preproccessing steps as generated from the node. All preprocessing steps that take place in the node are documented in the passed document. If no document is passed, the node generates an empty document and includes only information from the node. Figure 1 shows an example preprocessing workflow. Here the data set is read and then preprocessed. During preprocessing normalization, binning and transformation from a numerical to a categorical value is applied. The output PMML of the Number to String node contains the information of these three steps. Next an externally generated model is read using the PMML Reader and combined to the previously generated preprocessing PMML. In the last step the PMML Writer is used to export the new PMML document. 3. GENERATION OF ENSEMBLES In this section we will show how ensembles can be generated in KNIME. First of all, we outline generation, without the new PMML, support for bagging, boosting and delegating. Afterwards the PMML support is explained using bagging as an exemplary ensemble learning situation. 3.1 Bagging Bagging is an ensemble learning technique that creates multiple weak predictors and combines them to a single aggregated predictor that may lead to better predictions than each of the original ones. Each predictor is created by training a model on a different, representative part of the data. Even though each of the predictors might make a bad prediction on its own, as a collective they are often able to model the data s underlying concept better. To create a strong predictor, the weak predictions are aggregated using one of many different methods such as taking the mean of a numeric prediction or using the class that was predicted by the majority of predictors for a classification problem. Ensemble generation in KNIME was already supported in versions earlier than 2.8. It involves splitting the data into chunks, collecting the models in a table, iterating the table in another loop and applying the model on the input data. Afterwards the prediction results need to be collected and a voting loop calculates the final prediction. Figure 2 shows a workflow that can be used for learning and prediction, which creates an ensemble this way. The data are shuffled and then split into chunks, which are processed in a loop. For each chunk a learner learns a model which is then passed to a Model node. This node collects the models from all iterations and outputs a table with all the models once the last iteration has finished. In the second loop, another specialized node, the Model Loop Start, iterates the table with models and passes each model into a predictor node, which also has access to the data to be predicted. A Voting node finally produces an output by collecting the predictions from the predictor node and, for each record, selects the prediction that was made by most of the models. 3.2 Boosting Another ensemble learning strategy that can be realized with KNIME nodes is boosting: the AdaBoost algorithm [6]. As briefly outlined in the introduction, the AdaBoost algorithm assigns a higher weight to records that were misclassified in the previous iteration. In the subsequent iteration the sample is chosen based on this weight: therefore records with a higher weight are more likely to be chosen. Consequently the next base learner focuses on patterns incorrectly classified by the previous base learner. For boosting KNIME offers preconfigured meta nodes, which turn setting up a boosting workflow into fairly straight forward process. Two special meta nodes, namely the Boosting Learner and the Boosting Predictor node support such workflow. The content of a Boosting Learner for multilayer perceptrons is shown in Figure 3. The input to the Boosting Learner Loop Start is a data table containing the training data. In the loop, the RProp MLP Learner first trains a model on the data and then passes it to the PMML Predictor, which forms a prediction on the training data using the model. In the Boosting Learner the performance of the model is evaluated and another iteration is triggered until the model is considered good enough. Then a data table with multiple models is sent to the output port. Figure 4 shows the content of the associated prediction meta node. Here the models in the table that were generated by the learner are used to make a prediction for the incoming data. The Boosting Predictor then selects the best prediction. The Scorer node at the end can be used for debugging purposes, for example with a confusion matrix.
3 Shuffle Chunk Loop Start Decision Tree Learner Model Training Data Divide into chunks Collect models in table Model Loop Start Decision Tree Predictor Voting Data Loop over models Predict based on each model Aggregate predictions Figure 2: Bagging in KNIME. As described in Section 3.1 the basic bagging methodology is shown here. First multiple decision trees are learned on subsets of the data and collected in one table using the Model node. This table is used for predicting the test data, as read from the. The prediction and final aggregation of the multiple predicted values is performed in the second loop. Boosting Learner Loop Start RProp MLP Learner PMML Predictor Boosting Learner 3.3 Delegating The last ensemble technique implemented in KNIME is a simple version of delegating [4]. For delegating, patterns wrongly predicted from the precedent base learner are forwarded to the next one. Delegating is also implemented in a loop. The Delegating Loop Start outputs all the patterns that were either incorrect classified or that had received a classification with a low probability in a previous iteration. In the content of the loop a new model is learned and passed to the first port of the Delegating. The second port receives the insufficient classified data points from this run and sends them back to the Delegating Loop Start node. The loop finishes when either no more data points are classified as incorrect or after a fixed number of iterations. In Figure 5 a delegating regression in KNIME is shown. Figure 3: The boosting learner meta node in KNIME. Here a multi layer perceptron is trained in each loop iteration. The final algorithm is performed with a loop construct. Boosting Predictor Loop Start Boosting Predictor PMML Predictor Scorer Figure 4: The boosting predictor meta node in KNIME. The general PMML predictor here means that no additional configuration is necessary. For every boosting table the predictor will produce the prediction based on the results of the base learners. 3.4 PMML MiningModels The PMML standard has a special model type for ensemble models, which is called a MiningModel. It serves as a container for multiple data mining models that are part of an ensemble and also defines the aggregation method that should be used when making a prediction. The PMML Mining- Model consists of multiple segments. Each segment can be weighted optionally. Each segment contains a model and a predicate which tells the consuming node when it should use the model and when to ignore it. The model types that can be included in such an ensemble include TreeModel, NeuralNetwork, ClusteringModel, RegressionModel, GeneralRegressionModel and SupportVectorMachineModel. While it is possible to have models of different types in one Mining- Model, these models must all fit to the MiningModel s mining function, which can either be regression, classification or clustering. Another important attribute of a MiningModel is the aggregation method that is used to create a single prediction based on the results of the individual models. The PMML standard currently defines ten different aggregation methods of which nine are currently supported by KNIME. These methods include majority vote, sum, average and weighted average. Model chaining is not supported since it is currently not possible to generate nested Mining- Models from flat tables.
4 Polynomial Regression (Learner) PMML To Cell Delegating convert to cell Filter well predicted rows Delegating Loop Start PMML Predictor Math Formula Row Filter Column Filter calculate prediction error error < epsilon remove prediction Figure 5: A delegating regression is shown in this Figure. Well predicted data points are filtered, by calculating the error to the original value. Data points with a error higher than a threshold a filtered with the Row Filter and the rest is fed back to the start of the loop. 3.5 Creating PMML MiningModels Where possible (that is: supported in PMML), KNIME s mining nodes use PMML as the format for their models. Using special ports learner nodes output their models as PMML. A corresponding predictor consumes the PMML model together with the dataset to be predicted. As of version 2.7 KNIME also contains nodes that natively support consuming or producing PMML MiningModels. The newly introduced nodes allow the integration of several models from one or many learners into a single ensemble model, a so-called MiningModel as defined by [3]. There are two different approaches in KNIME to create a PMML MiningModel. An ensemble model can be created from a data table using the new Table to node. In order to generate the model the table must contain one column with PMMLCells and can optionally include a second numerical column used to assign a weight to each model in the PMML column. E.g. the Boosting Learner in Figure 3 and the Model in Figure 2 produce such a data table. The second approach feeds the models into the new Loop End node directly, where they are collected and output to the PMML MiningModel in the final iteration of the loop. Model weights can be assigned to the node by using a flow variable. These variables are passed from node to node together with the main data but have an extra port, denoted by a red circle as the input port. 3.6 Bagging realization with PMML For bagging with PMML ensemble models, the data first has to be split into chunks using the Chunk Loop Start node. The number of chunks the data is divided into is equal to the number of models the ensemble will contain. Shuffling the data beforehand is recommended in order to give each individual learner a representative chunk of data. In the loop initiated by the Chunk Loop Start the learner is applied to each chunk of data and passes the trained model to a node, which collects the models and outputs the ensemble once the loop terminates. The aggregation method for the ensemble can be set in the configuration of the. Additionally it is possible to assign each model a weight by passing a flow variable. This is a specialty only available with the new MiningModel based realization. Although the Model collects models as well it does not take weight into account. When a flow variable for weight is given, the checks the variables value in each iteration and assigns it as the currently trained models weight. While a minimal bagging workflow previously needed two loops and seven nodes, it can now be constructed with only one loop and five nodes. Figure 6 shows a bagging workflow that uses the new node. In Figure 7 the weight is read from a table and assigned via the optional flow variable port. The dialog contains configuration possibilities for the weight and the aggregation method. 4. WORKING WITH PMML ENSEMBLES In this section we show how ensembles, which are built using the or the Table to PMML Ensemble nodes can be modified and how ensemble prediction in KNIME works. 4.1 Modifying PMML MiningModels Apart from creating models, KNIME also supports the modification of ensembles and the retrieval of a subset of models from an ensemble. This makes it possible for example to load an externally generated mining model, add and remove models or modify a model s weight. To do so, an ensemble can be transformed into a data table with a PMML column for the models and a double column containing the weights of the models whereupon all of the data manipulation nodes available in KNIME can be used. The new node that is available for this task is the to Table node. KNIME provides several nodes to modify the resulting table, such as the Row Filter, Math Expression or Cell to PMML node. One could even imagine using the XML nodes to directly modify the PMML of each model. Figure 8 shows a workflow that loads a PMML Mining model from a file. It then splits the ensemble into a table structure, which contains the individual PMML models and the weights. The table is sorted by the model weight and only the top five models are retained. Finally a new ensemble mining model using only those five models is created and stored. Due to the fact that PMML is a XML dialect, KNIME s XML processing capabilities can be drawn upon when mod-
5 Shuffle Chunk Loop Start Decision Tree Learner Training Data Divide input into chunks Learn model Collect models Predictor Data Predict Figure 6: Bagging using the PMML MiningModels. The prediction and aggregation of the predicted values is performed by the Predictor. ifying various parts of PMML Mining Models, with the exception of from adding and removing models. XML nodes can be used to modify data dictionaries, predicates or model parameters. 4.2 Prediction with PMML MiningModels The generated ensemble models can also, of course, be used for prediction in KNIME. To make a prediction, each of the models in the ensemble is applied to the data to produce an individual prediction and these are then combined for the final output. We can, of course, do this within KNIME by splitting the ensemble and then applying individual predictors. However, in addition to to the new PMML ensemble creation nodes, there is also a new KNIME node, which can utilize these models to make direct predictions. The PMML Ensemble Predictor(as used in Figure 6 or 7) node has two inputs: a PMML Mining model input and a table input for the data to be predicted. For the prediction the node internally creates a predictor for each model in the ensemble and executes it on the input data. The result is a number of predictions from multiple models. These results are aggregated into a single value. Depending on the mining function of the ensemble, different aggregations, such as Majority Vote or Weighted Average for example, can be applied. For classification, only majority vote, weighted majority vote, select first and select all are applicable since the results from the predictions are not numeric. Results from applying regression models can only be aggregated using average, weighted average, media, max, sum, select first and select all. Clustering models can be used with (weighted) majority vote, select first and select all. The mining function of an ensemble is determined by an XML attribute in the PMML document and the predictor always checks if the selected aggregation is valid and displays an error otherwise. 5. CONCLUSIONS The new nodes in KNIME provide native support for PMML Mining Models which are either generated internally from KNIME learners or by using the already existing means of loading PMML into KNIME from other sources. Compared to the prior style of storing multiple models in data tables this new approach is easier to use and allows the sharing of other ensemble based mining models with other tools. Ensemble models that are created by the Table to or the nodes are fully compatible with existing KNIME nodes that process PMML. They can be read and written into the file system and stored in table cells. 6. REFERENCES [1] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer, [2] L. Breiman. Bagging predictors. Machine Learning, 24(2): , [3] DMG. Pmml multiple models: Model composition, ensembles, and segmentation. visited on May 6th [4] C. Ferri, P. Flach, and J. Hernández-Orallo. Delegating classifiers. In Proceedings of the twenty-first international conference on Machine learning, page 37. ACM, [5] Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14( ):1612, [6] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1): , [7] A. Guazzelli, W.-C. Lin, and T. Jena. PMML in Action: unleashing the power of open standards for data mining and predictive analytics. CreateSpace, [8] A. Guazzelli, K. Stathatos, and M. Zeller. Efficient deployment of predictive analytics through open standards and cloud computing. SIGKDD Explor. Newsl., 11(1):32 38, Nov [9] D. Morent, K. Stathatos, W.-C. Lin, and M. R. Berthold. Comprehensive pmml preprocessing in knime. In Proceedings of the PMML Workshop. KDD 2011, 2011.
6 Table Creator Row Filter TableRow To Variable Table with weights Select weight for iteration Make flow variable Shuffle Chunk Loop Start Decision Tree Learner Training Data Divide input into chunks Learn model for chunk Collect models Predictor Data Predict Figure 7: Bagging using flow variables to assign weights to individual models in the ensemble. variable is selected, the weight is set to 1 for all models. If no flow PMML Reader to Table Sorter Row Filter Table to PMML Ensemble PMML Writer Read ensemble Make table Sort by weight Take only top 5 Create new ensemble Write new ensemble Figure 8: Loading an ensemble model from a file, keeping only the top 5 models and writing it back to disk.
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
The R pmmltransformations Package
The R pmmltransformations Package Tridivesh Jena Alex Guazzelli Wen-Ching Lin Michael Zeller Zementis, Inc.* Zementis, Inc. Zementis, Inc. Zementis, Inc. Tridivesh.Jena@ Alex.Guazzelli@ Wenching.Lin@ Michael.Zeller@
What s Cooking in KNIME
What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Model Deployment. Dr. Saed Sayad. University of Toronto 2010 [email protected]. http://chem-eng.utoronto.ca/~datamining/
Model Deployment Dr. Saed Sayad University of Toronto 2010 [email protected] http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Fuzzy Logic in KNIME Modules for Approximate Reasoning
International Journal of Computational Intelligence Systems, Vol. 6, Supplement 1 (2013), 34-45 Fuzzy Logic in KNIME Modules for Approximate Reasoning Michael R. Berthold 1, Bernd Wiswedel 2, and Thomas
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Harnessing Big Data with KNIME
Harnessing Big Data with KNIME Tobias Kötter KNIME.com Agenda The three V s of Big Data Big Data Extension and Databases Nodes Demo 2 Variety, Volume, Velocity Variety: integrating heterogeneous data (and
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Fast and Easy Delivery of Data Mining Insights to Reporting Systems
Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb [email protected], [email protected] Abstract: During the last decade data mining and predictive
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Anomaly Detection in Predictive Maintenance
Anomaly Detection in Predictive Maintenance Anomaly Detection with Time Series Analysis Phil Winters Iris Adae Rosaria Silipo [email protected] [email protected] [email protected] Copyright
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
A Framework of User-Driven Data Analytics in the Cloud for Course Management
A Framework of User-Driven Data Analytics in the Cloud for Course Management Jie ZHANG 1, William Chandra TJHI 2, Bu Sung LEE 1, Kee Khoon LEE 2, Julita VASSILEVA 3 & Chee Kit LOOI 4 1 School of Computer
Analyzing the Web from Start to Finish Knowledge Extraction from a Web Forum using KNIME
Analyzing the Web from Start to Finish Knowledge Extraction from a Web Forum using KNIME Bernd Wiswedel Tobias Kötter Rosaria Silipo [email protected] [email protected] [email protected]
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment
Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment Rosaria Silipo Marco A. Zimmer [email protected]
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis
Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis (Version 1.17) For validation Document version 0.1 7/7/2014 Contents What is SAP Predictive Analytics?... 3
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services David Ferrucci 1, Robert L. Grossman 2 and Anthony Levas 1 1. Introduction - The Challenges of Deploying Analytic Applications
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
Easy Execution of Data Mining Models through PMML
Easy Execution of Data Mining Models through PMML Zementis, Inc. UseR! 2009 Zementis Development, Deployment, and Execution of Predictive Models Development R allows for reliable data manipulation and
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Anomaly Detection and Predictive Maintenance
Anomaly Detection and Predictive Maintenance Rosaria Silipo Iris Adae Christian Dietz Phil Winters [email protected] [email protected] [email protected] [email protected]
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Creating Usable Customer Intelligence from Social Media Data:
Creating Usable Customer Intelligence from Social Media Data: Network Analytics meets Text Mining Killian Thiel Tobias Kötter Dr. Michael Berthold Dr. Rosaria Silipo Phil Winters [email protected]
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
KNIME UGM 2014 Partner Session
KNIME UGM 2014 Partner Session DYMATRIX Stefan Weingaertner DYMATRIX CONSULTING GROUP 1 Agenda 1 Company Introduction 2 DYMATRIX Customer Intelligence Offering 3 PMML2SQL / PMML2SAS Converter 4 Uplift
Big Data, Smart Energy, and Predictive Analytics Time Series Prediction of Smart Energy Data
Big Data, Smart Energy, and Predictive Analytics Time Series Prediction of Smart Energy Data Rosaria Silipo Phil Winters [email protected] [email protected] Copyright 2013 by KNIME.com AG all
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
New Ensemble Combination Scheme
New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,
Geo-Localization of KNIME Downloads
Geo-Localization of KNIME Downloads as a static report and as a movie Thorsten Meinl Peter Ohl Christian Dietz Martin Horn Bernd Wiswedel Rosaria Silipo [email protected] [email protected] [email protected]
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Interactive Data Mining and Visualization
Interactive Data Mining and Visualization Zhitao Qiu Abstract: Interactive analysis introduces dynamic changes in Visualization. On another hand, advanced visualization can provide different perspectives
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
CUSTOMER Presentation of SAP Predictive Analytics
SAP Predictive Analytics 2.0 2015-02-09 CUSTOMER Presentation of SAP Predictive Analytics Content 1 SAP Predictive Analytics Overview....3 2 Deployment Configurations....4 3 SAP Predictive Analytics Desktop
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Back Propagation Neural Networks User Manual
Back Propagation Neural Networks User Manual Author: Lukáš Civín Library: BP_network.dll Runnable class: NeuralNetStart Document: Back Propagation Neural Networks Page 1/28 Content: 1 INTRODUCTION TO BACK-PROPAGATION
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
IBM SPSS Modeler 14.2 In-Database Mining Guide
IBM SPSS Modeler 14.2 In-Database Mining Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 197. This edition applies to IBM SPSS Modeler
A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML
www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework
Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008
Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable
Página 1 de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions
Universal PMML Plug-in for EMC Greenplum Database
Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. [email protected] USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619)
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
Make Better Decisions Through Predictive Intelligence
IBM SPSS Modeler Professional Make Better Decisions Through Predictive Intelligence Highlights Easily access, prepare and model structured data with this intuitive, visual data mining workbench Rapidly
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network
General Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Impelling
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
RapidMiner Radoop Documentation
RapidMiner Radoop Documentation Release 2.3.0 RapidMiner April 30, 2015 CONTENTS 1 Introduction 1 1.1 Preface.................................................. 1 1.2 Basic Architecture............................................
KNIME opens the Doors to Big Data. A Practical example of Integrating any Big Data Platform into KNIME
KNIME opens the Doors to Big Data A Practical example of Integrating any Big Data Platform into KNIME Tobias Koetter Rosaria Silipo [email protected] [email protected] 1 Table of Contents
The Prophecy-Prototype of Prediction modeling tool
The Prophecy-Prototype of Prediction modeling tool Ms. Ashwini Dalvi 1, Ms. Dhvni K.Shah 2, Ms. Rujul B.Desai 3, Ms. Shraddha M.Vora 4, Mr. Vaibhav G.Tailor 5 Department of Information Technology, Mumbai
DBMS / Business Intelligence, SQL Server
DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st
TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:
Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.
1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
