4 4 as.table.confusionmatrix as.table.confusionmatrix Save Confusion Table Results Description Usage Conversion functions for class confusionmatrix ## S3 method for class confusionmatrix as.matrix(x, what = "xtabs",...) ## S3 method for class confusionmatrix as.table(x,...) Arguments x what Details Value... not currently used an object of class confusionmatrix data to conver to matrix. Either "xtabs", "overall" or "classes" For as.table, the cross-tabulations are saved. For as.matrix, the three object types are saved in matrix format. A matrix or table Author(s) See Also Max Kuhn confusionmatrix Examples ################### ## 2 class example lvs <- c("normal", "abnormal") truth <- factor(rep(lvs, times = c(86, 258)), levels = rev(lvs)) pred <- factor(
5 avnnet.default 5 xtab <- table(pred, truth) c( rep(lvs, times = c(54, 32)), rep(lvs, times = c(27, 231))), levels = rev(lvs)) results <- confusionmatrix(xtab) as.table(results) as.matrix(results) as.matrix(results, what = "overall") as.matrix(results, what = "classes") ################### ## 3 class example xtab <- confusionmatrix(iris$species, sample(iris$species)) as.matrix(xtab) avnnet.default Neural Networks Using Model Averaging Description Usage Aggregate several neural network models ## Default S3 method: avnnet(x, y, repeats = 5, bag = FALSE, allowparallel = TRUE,...) ## S3 method for class formula avnnet(formula, data, weights,..., repeats = 5, bag = FALSE, allowparallel = TRUE, subset, na.action, contrasts = NULL) ## S3 method for class avnnet predict(object, newdata, type = c("raw", "class", "prob"),...) Arguments formula A formula of the form class ~ x1 + x x y matrix or data frame of x values for examples. matrix or data frame of target values for examples. weights (case) weights for each example if missing defaults to 1. repeats bag allowparallel the number of neural networks with different random number seeds a logical for bagging for each repeat if a parallel backend is loaded and available, should the function use it?
6 6 avnnet.default data subset na.action contrasts object newdata type Data frame from which variables specified in formula are preferentially to be taken. An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.) A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) a list of contrasts to be used for some or all of the factors appearing as variables in the model formula. an object of class avnnet as returned by avnnet. matrix or data frame of test examples. A vector is considered to be a row vector comprising a single case. Type of output, either: raw for the raw outputs, code for the predicted class or prob for the class probabilities.... arguments passed to nnet Details Following Ripley (1996), the same neural network model is fit using different random number seeds. All the resulting models are used for prediction. For regression, the output from each network are averaged. For classification, the model scores are first averaged, then translated to predicted classes. Bagging can also be used to create the models. If a parallel backend is registered, the foreach package is used to train the networks in parallel. Value For avnnet, an object of "avnnet" or "avnnet.formula". Items of interest in the output are: model repeats names a list of the models generated from nnet an echo of the model input if any predictors had only one distinct value, this is a character string of the remaining columns. Otherwise a value of NULL Author(s) These are heavily based on the nnet code from Brian Ripley. References Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge. See Also nnet, preprocess
7 bag.default 7 Examples data(bloodbrain) ## Not run: modelfit <- avnnet(bbbdescr, logbbb, size = 5, linout = TRUE, trace = FALSE) modelfit predict(modelfit, bbbdescr) ## End(Not run) bag.default A General Framework For Bagging Description Usage bag provides a framework for bagging classification or regression models. The user can provide their own functions for model building, prediction and aggregation of predictions (see Details below). bag(x,...) ## Default S3 method: bag(x, y, B = 10, vars = ncol(x), bagcontrol = NULL,...) bagcontrol(fit = NULL, predict = NULL, aggregate = NULL, downsample = FALSE, oob = TRUE, allowparallel = TRUE) ldabag plsbag nbbag ctreebag svmbag nnetbag ## S3 method for class bag predict(object, newdata = NULL,...) Arguments x y a matrix or data frame of predictors a vector of outcomes
8 8 bag.default B bagcontrol the number of bootstrap samples to train over. a list of options.... arguments to pass to the model function fit predict aggregate downsample oob allowparallel vars Details Value object newdata a function that has arguments x, y and... and produces a model object that can later be used for prediction. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag. a function that generates predictions for each sub-model. The function should have arguments object and x. The output of the function can be any type of object (see the example below where posterior probabilities are generated. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag.) a function with arguments x and type. The function that takes the output of the predict function and reduces the bagged predictions to a single prediction per sample. the type argument can be used to switch between predicting classes or class probabilities for classification models. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag. a logical: for classification, should the data set be randomly sampled so that each class has the same number of samples as the smallest class? a logical: should out-of-bag statistics be computed and the predictions retained? if a parallel backend is loaded and available, should the function use it? an integer. If this argument is not NULL, a random sample of size vars is taken of the predictors in each bagging iteration. If NULL, all predictors are used. an object of class bag. a matrix or data frame of samples for prediction. Note that this argument must have a non-null value The function is basically a framework where users can plug in any model in to assess the effect of bagging. Examples functions can be found in ldabag, plsbag, nbbag, svmbag and nnetbag. Each has elements fit, pred and aggregate. One note: when vars is not NULL, the sub-setting occurs prior to the fit and predict functions are called. In this way, the user probably does not need to account for the change in predictors in their functions. When using bag with train, classification models should use type = "prob" inside of the predict function so that predict.train(object, newdata, type = "prob") will work. If a parallel backend is registered, the foreach package is used to train the models in parallel. bag produces an object of class bag with elements fits control a list with two sub-objects: the fit object has the actual model fit for that bagged samples and the vars object is either NULL or a vector of integers corresponding to which predictors were sampled for that model a mirror of the arguments passed into bagcontrol
9 bagearth 9 call B dims the call the number of bagging iterations the dimensions of the training set Author(s) Max Kuhn Examples ## A simple example of bagging conditional inference regression trees: data(bloodbrain) ## treebag <- bag(bbbdescr, logbbb, B = 10, ## bagcontrol = bagcontrol(fit = ctreebag$fit, ## predict = ctreebag$pred, ## aggregate = ctreebag$aggregate)) ## An example of pooling posterior probabilities to generate class predictions data(mdrr) ## remove some zero variance predictors and linear dependencies mdrrdescr <- mdrrdescr[, -nearzerovar(mdrrdescr)] mdrrdescr <- mdrrdescr[, -findcorrelation(cor(mdrrdescr),.95)] ## basiclda <- train(mdrrdescr, mdrrclass, "lda") ## baglda2 <- train(mdrrdescr, mdrrclass, ## "bag", ## B = 10, ## bagcontrol = bagcontrol(fit = ldabag$fit, ## predict = ldabag$pred, ## aggregate = ldabag$aggregate), ## tunegrid = data.frame(vars = c((1:10)*10, ncol(mdrrdescr)))) bagearth Bagged Earth Description A bagging wrapper for multivariate adaptive regression splines (MARS) via the earth function
10 10 bagearth Usage ## S3 method for class formula bagearth(formula, data = NULL, B = 50, summary = mean, keepx = TRUE,..., subset, weights, na.action = na.omit) ## Default S3 method: bagearth(x, y, weights = NULL, B = 50, summary = mean, keepx = TRUE,...) Arguments formula A formula of the form y ~ x1 + x x y matrix or data frame of x values for examples. matrix or data frame of numeric values outcomes. weights (case) weights for each example - if missing defaults to 1. data subset na.action B Details summary keepx Data frame from which variables specified in formula are preferentially to be taken. An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.) A function to specify the action to be taken if NA s are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) the number of bootstrap samples a function with a single argument specifying how the bagged predictions should be summarized a logical: should the original training data be kept?... arguments passed to the earth function The function computes a Earth model for each bootstap sample. Value A list with elements fit B call x oob a list of B Earth fits the number of bootstrap samples the function call either NULL or the value of x, depending on the value of keepx a matrix of performance estimates for each bootstrap sample Author(s) Max Kuhn (bagearth.formula is based on Ripley s nnet.formula)
11 bagfda 11 References J. Friedman, Multivariate Adaptive Regression Splines (with discussion) (1991). Annals of Statistics, 19/1, See Also earth, predict.bagearth Examples ## Not run: library(mda) library(earth) data(trees) fit1 <- earth(trees[,-3], trees[,3]) fit2 <- bagearth(trees[,-3], trees[,3], B = 10) ## End(Not run) bagfda Bagged FDA Description Usage A bagging wrapper for flexible discriminant analysis (FDA) using multivariate adaptive regression splines (MARS) basis functions bagfda(x,...) ## S3 method for class formula bagfda(formula, data = NULL, B = 50, keepx = TRUE,..., subset, weights, na.action = na.omit) ## Default S3 method: bagfda(x, y, weights = NULL, B = 50, keepx = TRUE,...) Arguments formula A formula of the form y ~ x1 + x x y matrix or data frame of x values for examples. matrix or data frame of numeric values outcomes. weights (case) weights for each example - if missing defaults to 1. data subset Data frame from which variables specified in formula are preferentially to be taken. An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.)
12 12 bagfda na.action B Details Value keepx A function to specify the action to be taken if NA s are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) the number of bootstrap samples a logical: should the original training data be kept?... arguments passed to the mars function The function computes a FDA model for each bootstap sample. A list with elements fit B call x oob a list of B FDA fits the number of bootstrap samples the function call either NULL or the value of x, depending on the value of keepx a matrix of performance estimates for each bootstrap sample Author(s) Max Kuhn (bagfda.formula is based on Ripley s nnet.formula) References J. Friedman, Multivariate Adaptive Regression Splines (with discussion) (1991). Annals of Statistics, 19/1, See Also fda, predict.bagfda Examples library(mlbench) library(earth) data(glass) set.seed(36) intrain <- sample(1:dim(glass), 150) traindata <- Glass[ intrain, ] testdata <- Glass[-inTrain, ] baggedfit <- bagfda(type ~., traindata)
13 BloodBrain 13 confusionmatrix(predict(baggedfit, testdata[, -10]), testdata[, 10]) BloodBrain Blood Brain Barrier Data Description Mente and Lombardo (2005) develop models to predict the log of the ratio of the concentration of a compound in the brain and the concentration in blood. For each compound, they computed three sets of molecular descriptors: MOE 2D, rule-of-five and Charge Polar Surface Area (CPSA). In all, 134 descriptors were calculated. Included in this package are 208 non-proprietary literature compounds. The vector logbbb contains the concentration ratio and the data fame bbbdescr contains the descriptor values. Usage data(bloodbrain) Value bbbdescr logbbb data frame of chemical descriptors vector of assay results Source Mente, S.R. and Lombardo, F. (2005). A recursive-partitioning model for blood-brain barrier permeation, Journal of Computer-Aided Molecular Design, Vol. 19, pg BoxCoxTrans.default Box-Cox and Exponential Transformations Description These classes can be used to estimate transformations and apply them to existing and future data
14 14 BoxCoxTrans.default Usage BoxCoxTrans(y,...) expotrans(y,...) ## Default S3 method: BoxCoxTrans(y, x = rep(1, length(y)), fudge = 0.2, numunique = 3, na.rm = FALSE,...) ## Default S3 method: expotrans(y, na.rm = TRUE, init = 0, lim = c(-4, 4), method = "Brent", numunique = 3,...) ## S3 method for class BoxCoxTrans predict(object, newdata,...) ## S3 method for class expotrans predict(object, newdata,...) Arguments y x fudge numunique na.rm a numeric vector of data to be transformed. For BoxCoxTrans, the data must be strictly positive. an optional dependent variable to be used in a linear model. a tolerance value: lambda values within +/-fudge will be coerced to 0 and within 1+/-fudge will be coerced to 1. how many unique values should y have to estimate the transformation? a logical value indicating whether NA values should be stripped from y and x before the computation proceeds. init, lim, method initial values, limits and optimization method for optim.... for BoxCoxTrans: options to pass to boxcox. plotit should not be passed through. For predict.boxcoxtrans, additional arguments are ignored. object newdata an object of class BoxCoxTrans or expotrans. a numeric vector of values to transform. Details BoxCoxTrans function is basically a wrapper for the boxcox function in the MASS library. It can be used to estimate the transformation and apply it to new data. expotrans estimates the exponential transformation of Manly (1976) but assumes a common mean for the data. The transformation parameter is estimated by directly maximizing the likelihood. If any(y <= 0) or if length(unique(y)) < numunique, lambda is not estimated and no transformation is applied.
15 BoxCoxTrans.default 15 Value Both functions returns a list of class of either BoxCoxTrans or expotrans with elements lambda fudge n summary ratio skewness estimated transformation value value of fudge number of data points used to estimate lambda the results of summary(y) max(y)/min(y) sample skewness statistic BoxCoxTrans also returns: fudge value of fudge The predict functions returns numeric vectors of transformed values Author(s) Max Kuhn References Box, G. E. P. and Cox, D. R. (1964) An analysis of transformations (with discussion). Journal of the Royal Statistical Society B, 26, See Also Manly, B. L. (1976) Exponential data transformations. The Statistician, 25, boxcox, preprocess, optim Examples data(bloodbrain) ratio <- exp(logbbb) bc <- BoxCoxTrans(ratio) bc predict(bc, ratio[1:5]) ratio <- NA bc2 <- BoxCoxTrans(ratio, bbbdescr$tpsa, na.rm = TRUE) bc2 manly <- expotrans(ratio) manly
16 16 calibration calibration Probability Calibration Plot Description Usage For classification models, this function creates a calibration plot that describes how consistent model probabilities are with observed event rates. calibration(x,...) ## S3 method for class formula calibration(x, data = NULL, class = NULL, cuts = 11, subset = TRUE, lattice.options = NULL,...) ## S3 method for class calibration xyplot(x, data,...) panel.calibration(...) Arguments x data class cuts a lattice formula (see xyplot for syntax) where the left-hand side of the formula is a factor class variable of the observed outcome and the right-hand side specifies one or model columns corresponding to a numeric ranking variable for a model (e.g. class probabilities). The classification variable should have two levels. For calibration.formula, a data frame (or more precisely, anything that is a valid envir argument in eval, e.g., a list or an environment) containing values for any variables in the formula, as well as groups and subset if applicable. If not found in data, or if data is unspecified, the variables are looked for in the environment of the formula. This argument is not used for xyplot.calibration. a character string for the class of interest If a single number this indicates the number of splits of the data are used to create the plot. By default, it uses as many cuts as there are rows in data. If a vector, these are the actual cuts that will be used. subset An expression that evaluates to a logical or integer indexing vector. It is evaluated in data. Only the resulting rows of data are used for the plot. lattice.options A list that could be supplied to lattice.options... options to pass through to xyplot or the panel function (not used in calibration.formula).
17 calibration 17 Details Value calibration.formula is used to process the data and xyplot.calibration is used to create the plot. To construct the calibration plot, the following steps are used for each model: 1. The data are split into cuts - 1 roughly equal groups by their class probabilities 2. the number of samples with true results equal to class are determined 3. the event rate is determined for each bin xyplot.calibration produces a plot of the observed event rate by the mid-point of the bins. This implementation uses the lattice function xyplot, so plot elements can be changed via panel functions, trellis.par.set or other means. calibration uses the panel function panel.calibration by default, but it can be changed by passing that argument into xyplot.calibration. The following elements are set by default in the plot but can be changed by passing new values into xyplot.calibration: xlab = "Bin Midpoint", ylab = "Observed Event Percentage", type = "o", ylim = extendrange(c(0, 100)),xlim = extendrange(c(0, 100)) and panel = panel.calibration calibration.formula returns a list with elements: data cuts class probnames the data used for plotting the number of cuts the event class the names of the model probabilities xyplot.calibration returns a lattice object Author(s) See Also Max Kuhn, some lattice code and documentation by Deepayan Sarkar xyplot, trellis.par.set Examples ## Not run: data(mdrr) mdrrdescr <- mdrrdescr[, -nearzerovar(mdrrdescr)] mdrrdescr <- mdrrdescr[, -findcorrelation(cor(mdrrdescr),.5)] intrain <- createdatapartition(mdrrclass) trainx <- mdrrdescr[intrain[], ] trainy <- mdrrclass[intrain[]]
18 18 caretfuncs testx <- mdrrdescr[-intrain[], ] testy <- mdrrclass[-intrain[]] library(mass) ldafit <- lda(trainx, trainy) qdafit <- qda(trainx, trainy) testprobs <- data.frame(obs = testy, lda = predict(ldafit, testx)$posterior[,1], qda = predict(qdafit, testx)$posterior[,1]) calibration(obs ~ lda + qda, data = testprobs) calplotdata <- calibration(obs ~ lda + qda, data = testprobs) calplotdata xyplot(calplotdata, auto.key = list(columns = 2)) ## End(Not run) caretfuncs Backwards Feature Selection Helper Functions Description Usage Ancillary functions for backwards selection picksizetolerance(x, metric, tol = 1.5, maximize) picksizebest(x, metric, maximize) pickvars(y, size) caretfuncs lmfuncs rffuncs treebagfuncs ldafuncs nbfuncs gamfuncs lrfuncs Arguments x metric a matrix or data frame with the performance metric of interest a character string with the name of the performance metric that should be used to choose the appropriate number of variables
19 caretfuncs 19 maximize tol y size a logical; should the metric be maximized? a scalar to denote the acceptable difference in optimal performance (see Details below) a list of data frames with variables Overall and var an integer for the number of variables to retain Details This page describes the functions that are used in backwards selection (aka recursive feature elimination). The functions described here are passed to the algorithm via the functions argument of rfecontrol. See rfecontrol for details on how these functions should be defined. The pick functions are used to find the appropriate subset size for different situations. pickbest will find the position associated with the numerically best value (see the maximize argument to help define this). picksizetolerance picks the lowest position (i.e. the smallest subset size) that has no more of an X percent loss in performances. When maximizing, it calculates (O-X)/O*100, where X is the set of performance values and O is max(x). This is the percent loss. When X is to be minimized, it uses (X-O)/O*100 (so that values greater than X have a positive "loss"). The function finds the smallest subset size that has a percent loss less than tol. Both of the pick functions assume that the data are sorted from smallest subset size to largest. Author(s) See Also Max Kuhn rfecontrol, rfe Examples ## For picking subset sizes: ## Minimize the RMSE example <- data.frame(rmse = c(1.2, 1.1, 1.05, 1.01, 1.01, 1.03, 1.00), Variables = 1:7) ## Percent Loss in performance (positive) example$pctloss <- (example$rmse - min(example$rmse))/min(example$rmse)*100 xyplot(rmse ~ Variables, data= example) xyplot(pctloss ~ Variables, data= example) absolutebest <- picksizebest(example, metric = "RMSE", maximize = FALSE) within5pct <- picksizetolerance(example, metric = "RMSE", maximize = FALSE) cat("numerically optimal:", example$rmse[absolutebest], "RMSE in position", absolutebest, "\n")
20 20 caretsbf cat("accepting a 1.5 pct loss:", example$rmse[within5pct], "RMSE in position", within5pct, "\n") ## Example where we would like to maximize example2 <- data.frame(rsquared = c(0.4, 0.6, 0.94, 0.95, 0.95, 0.95, 0.95), Variables = 1:7) ## Percent Loss in performance (positive) example2$pctloss <- (max(example2$rsquared) - example2$rsquared)/max(example2$rsquared)*100 xyplot(rsquared ~ Variables, data= example2) xyplot(pctloss ~ Variables, data= example2) absolutebest2 <- picksizebest(example2, metric = "Rsquared", maximize = TRUE) within5pct2 <- picksizetolerance(example2, metric = "Rsquared", maximize = TRUE) cat("numerically optimal:", example2$rsquared[absolutebest2], "R^2 in position", absolutebest2, "\n") cat("accepting a 1.5 pct loss:", example2$rsquared[within5pct2], "R^2 in position", within5pct2, "\n") caretsbf Selection By Filtering (SBF) Helper Functions Description Usage Ancillary functions for univariate feature selection anovascores(x, y) gamscores(x, y) caretsbf lmsbf rfsbf treebagsbf ldasbf nbsbf Arguments x y a matrix or data frame of numeric predictors a numeric or factor vector of outcomes
21 cars 21 Details More details on these functions can be found at html#filter. This page documents the functions that are used in selection by filtering (SBF). The functions described here are passed to the algorithm via the functions argument of sbfcontrol. See sbfcontrol for details on how these functions should be defined. anovascores and gamscores are two examples of univariate filtering functions. anovascores fits a simple linear model between a single feature and the outcome, then the p-value for the whole model F-test is returned. gamscores fits a generalized additive model between a single predictor and the outcome using a smoothing spline basis function. A p-value is generated using the whole model test from summary.gam and is returned. If a particular model fails for lm or gam, a p-value of 1 is returned. Author(s) See Also Max Kuhn sbfcontrol, sbf, summary.gam cars Kelly Blue Book resale data for 2005 model year GM cars Description Kuiper (2008) collected data on Kelly Blue Book resale data for 804 GM cars (2005 model year). Usage data(cars) Value cars data frame of the suggested retail price (column Price) and various characteristics of each car (columns Mileage, Cylinder, Doors, Cruise, Sound, Leather, Buick, Cadillac, Chevy, Pontiac, Saab, Saturn, convertible, coupe, hatchback, sedan and wagon) Source Kuiper, S. (2008). Introduction to Multiple Regression: How Much Is Your Car Worth?, Journal of Statistics Education, Vol. 16, html
A Short Introduction to the caret Package Max Kuhn firstname.lastname@example.org August 6, 2015 The caret package (short for classification and regression training) contains functions to streamline the model training
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
JSS Journal of Statistical Software November 2008, Volume 28, Issue 5. http://www.jstatsoft.org/ Building Predictive Models in R Using the caret Package Max Kuhn Pfizer Global R&D Abstract The caret package,
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath email@example.com National Institute of Industrial Engineering (NITIE) Vihar
Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
Medical Information Management & Mining You Chen Jan,15, 2013 You.firstname.lastname@example.org 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Package survpresmooth February 20, 2015 Type Package Title Presmoothed Estimation in Survival Analysis Version 1.1-8 Date 2013-08-30 Author Ignacio Lopez de Ullibarri and Maria Amalia Jacome Maintainer
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Predictive Modeling with R and the caret Package user! 23 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT email@example.com Outline Conventions in R Data Splitting and Estimating Performance Data Pre-Processing
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller firstname.lastname@example.org Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Encoding UTF-8 Type Package Depends R (>= 2.12.0),survival,splines Package smoothhr November 9, 2015 Title Smooth Hazard Ratio Curves Taking a Reference Value Version 1.0.2 Date 2015-10-29 Author Artur
Package waterfall Type Package Title Waterfall Charts Version 1.0.2 Date 2016-04-02 August 29, 2016 URL https://jameshoward.us/software/waterfall/, https://github.com/howardjp/waterfall BugReports https://github.com/howardjp/waterfall/issues
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Version 1.3-8 Package polynom June 24, 2015 Title A Collection of Functions to Implement a Class for Univariate Polynomial Manipulations A collection of functions to implement a class for univariate polynomial
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Type Package Package hybridensemble May 30, 2015 Title Build, Deploy and Evaluate Hybrid Ensembles Version 1.0.0 Date 2015-05-26 Imports randomforest, kernelfactory, ada, rpart, ROCR, nnet, e1071, NMOF,
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
Fortgeschrittene Computerintensive Methoden Einheit 3: mlr - Machine Learning in R Bernd Bischl Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch Institut für Statistik LMU München SoSe 2014
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,