R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi
|
|
|
- Frederick Murphy
- 9 years ago
- Views:
Transcription
1 R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve orvath and Tao Shi Correspondence: Department of uman Genetics and Biostatistics University of California, os Angeles, CA , USA. In this R software tutorial we describe some of the results underlying the following article. Shi T, Seligson D, Belldegrun AS, Palotie A, orvath S. (2005) Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol Apr;18(4): Additional References General intro to random forest Breiman. Random forests. Machine earning 2001;45(1): Breiman and Adele Cutler s random forests: The following article describes theoretical studies of RF clustering. Tao Shi and Steve orvath (2006) Unsupervised earning with Random Forest Predictors. Journal of Computational and Graphical Statistics. Volume 15, Number 1, March 2006, pp (21) The following reference describes the R implementation of random forests iaw A. and Wiener M. Classification and Regression by randomforest. R News, 2(3):18-22, December The tutorial and data can be found at the following webpage. The following webpage contains additional, theoretical material 1
2 ## The following tutorial shows how to carry out RF clustering ## using the freely available software R ## ( Before running it, you need to install ## the randomforest library, which is a contributed package in R. ## RF clustering takes 3 parameters: ## 1) number of features sampled at each split ## 2) number of forests ## 3) number of trees in each forest. ## They are implemented in the function RFdist as the options: mtry1, no.rep, ## and no.tree, respectively. ## The file FunctionsRFclustering.txt also contains other relevant functions ## such as "Rand" for the Rand index. ###################R Software and Inputting the Data#################### ## 1) To install the R software, go to ## 2) After installing R, you need to install two additional R packages: randomforest and misc ## Open R and go to menu "Packages\Install package(s) from CRAN", then ## choose randomforest. R will automatically ## install the package. When asked "Delete downloaded files (y/n)? ", answer "y". ## 3) Download the zip file containing: ## a) R function file: "FunctionsRFclustering.txt", which contains several ## R functions needed for RF clustering and results assessment ## b) A test data file: "testdata.csv" ## c) MDS coordinate file: "cmd1.csv" ## d) The tutorial file: "RFclusteringTutorial.txt" ## 4) Unzip all the files into the same directory. ## 5) Open the R software by double clicking its icon. ## 6) Open the tutorial file "RFclusteringTutorialRenalCancer.doc" in ## Microsoft Word or an alternative text editor. ## 7) Copy and paste the R commands from the tutorial into the R session. ## Comments are preceded by "#" and are automatically ignored by R. # set the working directory (where the data are located) setwd("c:/documents and Settings/shorvath/My Documents/ADAG/TaoShi/RFclustering/TutorialRCC366 ) ## load the library and ignore the warning message source("functionsrfclustering.txt") 2
3 # ere we read in the data from all 366 patients dat1 = read.table("testdata_366.csv", sep=",", header=t, row.names=1) ## This is the input file for RF clustering algorithm ## the first 8 columns contain the tumor marker data. Rows correspond to the patients ##i.e. the objects that will be clustered. datrf = dat1[,1:8] names(datrf) [1] "Marker1" "Marker2" "Marker3" "Marker4" "Marker5" "Marker6" "Marker7" "Marker8" ## Calculating RF distance between samples based on the 8 marker measurements ## This will take quite long time depends how ## many tree and repetitions you choose ## We suggest to use relatively large number of forests ## with large number of trees distrf = RFdist(datRF, mtry1=3, 4000, 100, addcl1=t,addcl2=f,imp=t, oob.prox1=t) ## Classical multidimensional scaling based on RF distance # we use 2 scaling dimensions. cmd1 = cmdscale(as.dist(distrf$cl1),2) ## Due to the randomness of the RF clustering algorithm, you may get slightly different results ## To save time, we provide the coordinates of cmd1 in the file "cmd1_366.csv", ## so we just read it in directly cmd1 = as.matrix(read.csv( cmd1_366.csv, header=t, row.names=1)) # The above line should be removed when you run the code on a new data set. 3
4 plot(cmd1,xlab= Scaling Dimension 1,ylab= Scaling Dimension 2 ) We also make use of an alternative partitioning around medoid function "pamnew" pamnew corrects the clustering membership assignment by taking account of the sillhouette strength which is standard output of pam. Specifically, the clustering membership of an observation with a negative sillhouette strength is reassigned to its neighboring cluster. #In the following, we will use PAM clustering on the 2 scaling coordinates ## PAM clustering based on the scaling coordinates RFclusterabel = pamnew(cmd1, 2) Comment for statisticians: This last step requires an explanation since it is rather unusual. A standard RF clustering analysis would use pam clustering in conjunction with the original RF dissimilarity without using scaling coordinates. Using the scaling dimensions to process the RF dissimilarity amounts to using a reduced form of the RF dissimilarity. We refer to the resulting dissimilarity as processed RF dissimilarity. In other contexts, several authors have advocated the use principal components in clustering analysis in order to amplify the signal. owever, this is a controversial step. Many statisticians would argue that the unprocessed, original RF dissimilarity contains a more meaningful signal. As discussed below, we use the unprocessed RF dissimilarity in our latest work. ere we report the old analysis to enhance reproducibility. Actually, the main reason why we used the processed RF dissimilarity is that we wanted to visualize the cluster assignment in the corresponding 2 dimensional scaling plot. # See the following plot. 4
5 ## Figure 1.A in Shi, Seligson et al (2005) # Classical MDS plot # Clear cell patients are labeled by C and non-clear cell patients are labelled by N # Patients are colored by their RF clustering memberships plot(cmd1, type="n",xlab= Scaling Dimension 1, ylab= Scaling Dimension 2 ) text(cmd1, label=ifelse(dat1$clearcell==1, "C", "N"), col= RFclusterabel) 5
6 ## The following corresponds to Figure 1.B ## Now we check survival difference between the two RF clusters ## Now we use Kaplan Meier plots to compare survival distributions. ## time denotes the survival or follow up time ## event denotes the death indicator variable ## We also report the og-rank test p value fit.diff = survdiff(surv(time, event) ~ factor(rfclusterabel), data=dat1) chisq2 = signif(1-pchisq(fit.diff$chisq,length(levels(factor(rfclusterabel)))-1), 3) fit1 = survfit(surv(time, event)~rfclusterabel, data=dat1, conf.type="log-log") plot(fit1, conf.int=f,col=c(1:3), xlab="time to death (years)", ylab="survival", main=c("k-m curves"), legend.text=c(paste("1 og Rank p value=", chisq2), 2)); K-M curves Survival og Rank p value= Time to death (years) Comment: the p-value is less significant than that reported in our article. Main reason: in the article, we used a slightly different clustering procedure (different from pamnew). In this tutorial, we strive to be consistent with other analyses and hence use pamnew. If you want the details of the original clustering analysis, please contact [email protected] 6
7 # For each patient, we define its clear cell status (cancer type) ClearCellStatus= ifelse(dat1$clearcell==1, "Clear", "Non-Clear") ## cluster 1 is enriched with clear cell patients and cluster 2 is enriched with ##non-clear cell patients (Table Supp1) chisq.test(print(table(rfclusterabel, ClearCellStatus))) ClearCellStatus RFclusterabel Clear Non-Clear Pearson's Chi-squared test with Yates' continuity correction data: print(table(rfclusterabel, ClearCellStatus)) X-squared = , df = 1, p-value < 2.2e-16 # Comment: note that the RF cluster label disagrees with the clear cell status on 23+7=30 patients. #What if we used original pam clustering for the RF dissimilarity (instead of pamnew)? RFclusterabelOriginalPAM = pam(cmd1, 2)$clustering table(rfclusterabeloriginalpam, ClearCellStatus ) ClearCellStatus RFclusterabelOriginalPAM Clear Non-Clear # This leads to 28 misclassifications. 7
8 ## Next, we compare RF clustering results with clustering results ## based on a processed version of the Euclidean distance. ## Specifically, to arrive at an unbiased comparison with ## the RF clustering method, we also use the cmdscale processing ## Euclidean distance Euclidclusterabel = pamnew(cmdscale(dist(datrf),2),2) table(euclidclusterabel, ClearCellStatus ) ClearCellStatus Euclidclusterabel Clear Non-Clear # Comment: note that the Euclidean cluster label disagrees with the clear cell status on 38+5=43 patients. ere the RF dissimilarity seems to do better. #Note that in the above analysis, we used processed RF dissimilarity measures (i.e. we used multi- #dimensional scaling ). In the following we briefly report the finding of using the original, #unprocessed Euclidean distance. UnprocessedEuclidclusterabel= pamnew(dist(datrf),2) table(unprocessedeuclidclusterabel, ClearCellStatus ) UnprocessedEuclidclusterabel Clear Non-Clear #Now there are 40 misassignments. Thus, processing does not make a big difference for the Euclidean distance. If anything, it makes the clustering result worse. # What would happen if we used the original PAM clustering procedure (instead of pamnew)? UnprocessedEuclidclusterabelOrignalPAM= pam(dist(datrf),2)$clustering table(unprocessedeuclidclusterabelorignalpam, ClearCellStatus ) ClearCellStatus UnprocessedEuclidclusterabelOrignalPAM Clear Non-Clear #Now there are 44+4=48 misclassifications. 8
9 ##=============================================================## ## ## ## Now we restrict the analysis to the 307 Clear Cell patients ## ## ## ##=============================================================## ## read in the data set. 307 clear cell patients dat1 = read.table("testdata_307.csv", sep=",", header=t, row.names=1) ## This is the input file for RF clustering algorithm datrf = dat1[,1:8] ## Calculating RF distance between samples based on the 8 marker measurements ## This will take quite long time depends how many tree and repetitions you choose ## We suggest to use relatively large number of forests with large number of trees distrf = RFdist(datRF, mtry1=3, 4000, 100, addcl1=t,addcl2=f,imp=t, oob.prox1=t) ## Multidimensional scaling based on RF distance cmd1 = cmdscale(as.dist(distrf$cl1),2) ## To save time, I've provided the coordinates of cmd1 in the file "cmd1.csv", #so we just read it in directly cmd1 = as.matrix(read.csv("cmd1_307.csv", header=t, row.names=1)) ## PAM clustering based on the scaling coordinates RFclusterabel = pamnew(cmd1, 3) ## check survival difference ## variables "time" and "event" in dat1 are survival time and cencering indicator, respectively ## og-rank p value = 3.63e-08! fit.diff = survdiff(surv(time, event) ~ factor(rfclusterabel), data=dat1) chisq2 = signif(1-pchisq(fit.diff$chisq,length(levels(factor(rfclusterabel)))-1), 3) fit1 = survfit(surv(time, event)~rfclusterabel, data=dat1, conf.type="log-log") plot(fit1, conf.int=f,col=c(1:3), xlab="time to death (years)", ylab="survival", main=c("k-m curves"), legend.text=c(paste("1 og Rank p value=", chisq2), 2,3)); 9
10 ## Note that clusters 1(black) and 2(red) have similar surivival distributions. ## Therefore, we combine the two clusters and create a new cluster label label2 = 1+(RFclusterabel>2) 10
11 ## Check survival difference between the resulting RF cluster labels. ## og-rank p value = 4.82e-09! (Fig 2.B) fit.diff = survdiff(surv(time, event) ~ factor(label2), data=dat1) chisq2 = signif(1-pchisq(fit.diff$chisq,length(levels(factor(label2)))-1), 3) fit1 = survfit(surv(time, event)~label2, data=dat1, conf.type="log-log") plot(fit1, conf.int=f,col=c(1:3), xlab="time to death (years)", ylab="survival", main=c("k-m curves"), legend.text=c(paste("1 og Rank p value=", chisq2), 2)); K-M curves Survival og Rank p value= 4.82e Time to death (years) 11
12 12 ## cluster 1 is enriched with low-grade patients (grade<3) and cluster 2 is enriched #with high-grade patients chisq.test(print(table(label2, dat1$grade>2))) label2 FASE TRUE Pearson's Chi-squared test with Yates' continuity correction data: print(table(label2, dat1$grade > 2)) X-squared = , df = 1, p-value = 1.605e-05 ## MDS plot, The patients are labeled by '' for high-grade patients or '' for low-grade patients ## and colored by their RF clustering memberships plot(cmd1,xlab= Scaling Dimension 1,ylab= Scaling Dimension 2, type="n") text(cmd1, label=ifelse(dat1$grade>2, "", ""), col=label2) Scaling Dimension 1 Scaling Dimension 2
13 ## Comparing RF clustering results with clustering results based on Euclidean distance Euclidclusterabel = pamnew(cmdscale(dist(datrf),2),2) ## Euclidean distance ## Notice RF clustering gives lowest number of missclassified pateints table(label2, ifelse(dat1$grade>2, "igh", "ow")) label2 igh ow table(euclidclusterabel, ifelse(dat1$grade>2, "igh", "ow")) Euclidclusterabel igh ow ## check survival difference using Euclidean distance based clusters ## here we use 3 clusters label4 = pamnew(datrf, 3) ## check survival difference fit.diff = survdiff(surv(time, event) ~ factor(label4), data=dat1) chisq2 = signif(1-pchisq(fit.diff$chisq,length(levels(factor(label4)))-1), 3) fit1 = survfit(surv(time, event)~label4, data=dat1, conf.type="log-log") plot(fit1, conf.int=f,col=c(1:3), xlab="time to death (years)", ylab="survival", main=c("k-m curves"), legend.text=c(paste("1 og Rank p value=", chisq2), 2,3)); K-M curves Survival og Rank p value= Time to death (years) ## Thus when using the Euclidean distance on the original data, the resulting KM curves are not significantly different. 13
14 ## Now we use standard PAM clustering in conjunction with the Euclidean distance label5 = pam(dist(datrf), 3)$clustering ## check survival difference fit.diff = survdiff(surv(time, event) ~ factor(label5), data=dat1) chisq2 = signif(1-pchisq(fit.diff$chisq,length(levels(factor(label5)))-1), 3) fit1 = survfit(surv(time, event)~label5, data=dat1, conf.type="log-log") plot(fit1, conf.int=f,col=c(1:3), xlab="time to death (years)", ylab="survival", main=c("k-m curves"), legend.text=c(paste("1 og Rank p value=", chisq2), 2,3)); K-M curves Survival og Rank p value= Time to death (years) ## The resulting KM curves based on the Euclidean distance are not significantly different. 14
15 Discussion Cluster analysis will always remain somewhat of an art form. Myriads of clustering procedures have been developed. Different from the case of supervised learning methods, it is difficult to gather empirical evidence that one procedure is superior over another. While our analysis provides evidence that the RF dissimilarity is superior to the Euclidean distance in this application, it is reassuring that the 2 dissimilarities provide similar results on many other (unreported) data sets. The difference between the RF dissimilarity and the Euclidean distance are highlighted in Shi and orvath (2006). In this theory paper, we present a slightly different and more thorough RF analysis of the 307 clear cell data. Main difference: we report the RF cluster analysis using the unprocessed, original RF dissimilarity. A corresponding tutorial can be found here We are always glad to hear from successes and failures of the RF dissimilarity. Please edits and suggestions to [email protected] 15
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Linda Staub & Alexandros Gekenidis
Seminar in Statistics: Survival Analysis Chapter 2 Kaplan-Meier Survival Curves and the Log- Rank Test Linda Staub & Alexandros Gekenidis March 7th, 2011 1 Review Outcome variable of interest: time until
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller
Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller Most of you want to use R to analyze data. However, while R does have a data editor, other programs such as excel are often better
Chapter 19 The Chi-Square Test
Tutorial for the integration of the software R with introductory statistics Copyright c Grethe Hystad Chapter 19 The Chi-Square Test In this chapter, we will discuss the following topics: We will plot
Prediction Analysis of Microarrays in Excel
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Regression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
K-Means Clustering Tutorial
K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July
Hierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Structural Health Monitoring Tools (SHMTools)
Structural Health Monitoring Tools (SHMTools) Getting Started LANL/UCSD Engineering Institute LA-CC-14-046 c Copyright 2014, Los Alamos National Security, LLC All rights reserved. May 30, 2014 Contents
With this document you will be able to download and install all of the components needed for a full installation of PCRASTER for windows.
PCRaster Software downloading step by step: With this document you will be able to download and install all of the components needed for a full installation of PCRASTER for windows. First you must download
IBM SPSS Statistics 20 Part 1: Descriptive Statistics
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 1: Descriptive Statistics Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study
Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study The data for this study is taken from experiment GSE848 from the Gene Expression
Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:
Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]
IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Model Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
Tutorial 2 Online and offline Ship Visualization tool Table of Contents
Tutorial 2 Online and offline Ship Visualization tool Table of Contents 1.Tutorial objective...2 1.1.Standard that will be used over this document...2 2. The online tool...2 2.1.View all records...3 2.2.Search
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
InfiniteInsight 6.5 sp4
End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Tutorial Segmentation and Classification
MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Introduction to R Statistical Software
Introduction to R Statistical Software Anthony (Tony) R. Olsen USEPA ORD NHEERL Western Ecology Division Corvallis, OR 97333 (541) 754-4790 [email protected] What is R? A language and environment for
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions
################ EXAMPLE ANALYSES TO ILLUSTRATE SS-RPMM ######################## # load in the files containing the methyaltion data and the source # code containing the SSRPMM functions # Note, the SSRPMM
R Commander Tutorial
R Commander Tutorial Introduction R is a powerful, freely available software package that allows analyzing and graphing data. However, for somebody who does not frequently use statistical software packages,
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
FIRST STEPS WITH SCILAB
powered by FIRST STEPS WITH SCILAB The purpose of this tutorial is to get started using Scilab, by discovering the environment, the main features and some useful commands. Level This work is licensed under
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
CE 490 Transportation Engineering Network Analysis
CE 490 Transportation Engineering Network Analysis The purpose of this assignment is to conduct a network analysis and gain some experience in emerging microcomputer analysis techniques. 1) Select a network
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Visualization of Phylogenetic Trees and Metadata
Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com [email protected]
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Basic -- Manage Your Bank Account and Your Budget
Basic -- Manage Your Bank Account and Your Budget This tutorial is intended as a quick overview to show you how to set up a budget file for basic account management as well as budget management. (See the
Website Development Komodo Editor and HTML Intro
Website Development Komodo Editor and HTML Intro Introduction In this Assignment we will cover: o Use of the editor that will be used for the Website Development and Javascript Programming sections of
COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks
COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks You learned in lecture about computer viruses and worms. In this lab you will study virus propagation at the quantitative
COC131 Data Mining - Clustering
COC131 Data Mining - Clustering Martin D. Sykora [email protected] Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations
Supervised DNA barcodes species classification: analysis, comparisons and results Emanuel Weitschek, Giulia Fiscon, and Giovanni Felici Citations If you use this procedure please cite: Weitschek E, Fiscon
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
This means that any user from the testing domain can now logon to Cognos 8 (and therefore Controller 8 etc.).
ChaseReferrals and multidomaintrees Graphical explanation of the difference Imagine your Active Directory network looked as follows: Then imagine that you have installed your Controller report server inside
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
A General Framework for Weighted Gene Co-expression Network Analysis
Please cite: Statistical Applications in Genetics and Molecular Biology (2005). A General Framework for Weighted Gene Co-expression Network Analysis Bin Zhang and Steve Horvath Departments of Human Genetics
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Beginner s Matlab Tutorial
Christopher Lum [email protected] Introduction Beginner s Matlab Tutorial This document is designed to act as a tutorial for an individual who has had no prior experience with Matlab. For any questions
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Minitab Tutorials for Design and Analysis of Experiments. Table of Contents
Table of Contents Introduction to Minitab...2 Example 1 One-Way ANOVA...3 Determining Sample Size in One-way ANOVA...8 Example 2 Two-factor Factorial Design...9 Example 3: Randomized Complete Block Design...14
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING
Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology
TakeMySelfie ios App Documentation
TakeMySelfie ios App Documentation What is TakeMySelfie ios App? TakeMySelfie App allows a user to take his own picture from front camera. User can apply various photo effects to the front camera. Programmers
Classification Techniques (1)
10 10 Overview Classification Techniques (1) Today Classification Problem Classification based on Regression Distance-based Classification (KNN) Net Lecture Decision Trees Classification using Rules Quality
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Guide for Data Visualization and Analysis using ACSN
Guide for Data Visualization and Analysis using ACSN ACSN contains the NaviCell tool box, the intuitive and user- friendly environment for data visualization and analysis. The tool is accessible from the
Time series clustering and the analysis of film style
Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such
Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
Personalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Learn how to create web enabled (browser) forms in InfoPath 2013 and publish them in SharePoint 2013. InfoPath 2013 Web Enabled (Browser) forms
Learn how to create web enabled (browser) forms in InfoPath 2013 and publish them in SharePoint 2013. InfoPath 2013 Web Enabled (Browser) forms InfoPath 2013 Web Enabled (Browser) forms Creating Web Enabled
Specific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
There are six different windows that can be opened when using SPSS. The following will give a description of each of them.
SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet
Oracle Data Miner (Extension of SQL Developer 4.0)
An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining
Statgraphics Getting started
Statgraphics Getting started The aim of this exercise is to introduce you to some of the basic features of the Statgraphics software. Starting Statgraphics 1. Log in to your PC, using the usual procedure
STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.
STATGRAPHICS Online Statistical Analysis and Data Visualization System Revised 6/21/2012 Copyright 2012 by StatPoint Technologies, Inc. All rights reserved. Table of Contents Introduction... 1 Chapter
In this tutorial, we try to build a roc curve from a logistic regression.
Subject In this tutorial, we try to build a roc curve from a logistic regression. Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build
STATISTICAL DATA ANALYSIS
STATISTICAL DATA ANALYSIS INTRODUCTION Fethullah Karabiber YTU, Fall of 2012 The role of statistical analysis in science This course discusses some statistical methods, which involve applying statistical
Statistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 4, Issue 1 2005 Article 17 A General Framework for Weighted Gene Co-Expression Network Analysis Bin Zhang Steve Horvath Departments of
SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis
SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium
Tutorial for the WGCNA package for R I. Network analysis of liver expression data in female mice
Tutorial for the WGCNA package for R I. Network analysis of liver expression data in female mice 2.c Dealing with large data sets: block-wise network construction and module detection Peter Langfelder
