Package dsmodellingclient



Similar documents
Package dsstatsclient

Generalized Linear Models

SAS Software to Fit the Generalized Linear Model

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Package uptimerobot. October 22, 2015

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Directions for using SPSS

Package retrosheet. April 13, 2015

Package MDM. February 19, 2015

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

SUMAN DUVVURU STAT 567 PROJECT REPORT

Logistic Regression (1/24/13)

Multivariate Logistic Regression

Gamma Distribution Fitting

Simple Predictive Analytics Curtis Seare

Package metafuse. November 7, 2015

Package missforest. February 20, 2015

Psychology 205: Research Methods in Psychology

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: SIMPLIS Syntax Files

7 Generalized Estimating Equations

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Multiple Choice: 2 points each

Basic Statistical and Modeling Procedures Using SAS

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

More details on the inputs, functionality, and output can be found below.

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

HLM software has been one of the leading statistical packages for hierarchical

GLM I An Introduction to Generalized Linear Models

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Estimation of σ 2, the variance of ɛ

SAS Syntax and Output for Data Manipulation:

11. Analysis of Case-control Studies Logistic Regression

Advanced Statistical Analysis of Mortality. Rhodes, Thomas E. and Freitas, Stephen A. MIB, Inc. 160 University Avenue. Westwood, MA 02090

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

CLC Server Command Line Tools USER MANUAL

Package bigrf. February 19, 2015

Coefficient of Determination

Chapter 7: Simple linear regression Learning Objectives

IBM SPSS Missing Values 22

extreme Datamining mit Oracle R Enterprise

Package ATE. R topics documented: February 19, Type Package Title Inference for Average Treatment Effects using Covariate. balancing.

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Color Screen Phones: SIP-T48G and SIP-T46G with firmware version 73

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

IBM SPSS Neural Networks 22

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Logistic Regression (a type of Generalized Linear Model)

2013 MBA Jump Start Program. Statistics Module Part 3

intertrax Suite intertrax exchange intertrax monitor intertrax connect intertrax PIV manager User Guide Version

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Package sjdbc. R topics documented: February 20, 2015

Oracle Data Miner (Extension of SQL Developer 4.0)

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Supplementary PROCESS Documentation

IBM SPSS Direct Marketing 23

Simple Linear Regression Inference

Week TSX Index

Introducing the Multilevel Model for Change

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Additional sources Compilation of sources:

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Statistics in Retail Finance. Chapter 6: Behavioural models

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

IBM SPSS Direct Marketing 22

Aras Corporation Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability

Data Analysis Tools. Tools for Summarizing Data

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Package neuralnet. February 20, 2015

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

hp calculators HP 50g Trend Lines The STAT menu Trend Lines Practice predicting the future using trend lines

SnapLogic Salesforce Snap Reference

Model Selection and Claim Frequency for Workers Compensation Insurance

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Illustration (and the use of HLM)

7.1 The Hazard and Survival Functions

CDD user guide. PsN Revised

MAT 242 Test 3 SOLUTIONS, FORM A

Introduction to Structural Equation Modeling (SEM) Day 4: November 29, 2012

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Elizabeth Comino Centre fo Primary Health Care and Equity 12-Aug-2015

Polynomial Neural Network Discovery Client User Guide

Statistical Functions in Excel

Package TSfame. February 15, 2013

Journal of Statistical Software

MULTIPLE REGRESSION EXAMPLE

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Using R for Windows and Macintosh

Statistics Graduate Courses

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

From the help desk: Bootstrapped standard errors

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Package EstCRM. July 13, 2015

Transcription:

Package dsmodellingclient Maintainer <datashield@obiba.org> Author <datashield@obiba.org> Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD client site functions for statistical modelling Depends opal, dsbaseclient R topics documented: ds.gee............................................ 1 ds.glm............................................ 3 ds.lexis........................................... 5 geelogindata........................................ 8 geelogin_remoteserver................................... 8 glmlogindata........................................ 9 glmlogin_remoteserver.................................. 9 survivallogindata...................................... 10 Index 11 ds.gee Fits a Generalized Estimating Equation (GEE) model A function that fits generalized estimated equations to deal with correlation structures arising from repeated measures on individuals, or from clustering as in family data. 1

2 ds.gee ds.gee(formula = NULL, family = NULL, data = NULL, corstructure = "ar1", clusterid = NULL, startcoeff = NULL, usermatrix = NULL, maxit = 20, checks = TRUE, display = FALSE, datasources = NULL) Arguments formula family data corstructure clusterid startcoeff usermatrix maxit checks display datasources a string character, the formula which describes the model to be fitted. a character, the description of the error distribution: binomial, gaussian, Gamma or poisson. the name of the data frame that hold the variables in the regression formula. a character, the correlation structure: ar1, exchangeable, independence, fixed or unstructure. a character, the name of the column that hold the cluster IDs a numeric vector, the starting values for the beta coefficients. a list of user defined matrix (one for each study). These matrices are required if the correlation structure is set to fixed. an integer, the maximum number of iteration to use for convergence. a boolean, if TRUE (default) checks that takes 1-3min are carried out to verify that the variables in the model are defined (exist) on the server site and that they have the correct characteristics required to fit a GEE. If FALSE (not recommended if you are not an experienced user) no checks are carried except some very basic ones and eventual error messages might not give clear indications about the cause(s) of the error. a boolean to display or not the intermediate results. Default is FALSE. a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a dataframe, from opal datasources. Details It enables a parallelized analysis of individual-level data sitting on distinct servers by sending commands to each data computer to fit a GEE model model. The estimates returned are then combined and updated coefficients estimate sent back for a new fit. This iterative process goes on until convergence is achieved. The input data should not contain missing values. The data must be in a data.frame obejct and the variables must be refer to through the data.frame. Value a list which contains the final coefficient estimates (beta values), the pooled alpha value and the pooled phi value. Author(s) Gaye, A.; Jones EM.

ds.glm 3 References Jones EM, Sheehan NA, Gaye A, Laflamme P, Burton P. Combined analysis of correlated data when data cannot be pooled. Stat 2013; 2: 72-85. See Also ds.glm for genralized linear models ds.lexis for survival analysis using piecewise exponential regression { } # load the login data file for the correlated data data(geelogindata) # login and assign all the stored variables to R opals <- datashield.login(logins=geelogindata,assign=true) # set some parameters for the function 9the rest are set to default values) myformula <- response~1+sex+age.60 myfamily <- binomial startbetas <- c(-1,1,0) clusters <- id mycorr <- ar1 # run a GEE analysis with the above specifed parameters ds.gee(data= D,formula=myformula,family=myfamily,corStructure=mycorr,clusterID=clusters,startCoeff=startbeta # clear the Datashield R sessions and logout datashield.logout(opals) ds.glm Runs a combined GLM analysis of non-pooled data A function fit generalized linear models ds.glm(formula = NULL, data = NULL, family = NULL, offset = NULL, weights = NULL, checks = FALSE, maxit = 15, CI = 0.95, viewiter = FALSE, datasources = NULL)

4 ds.glm Arguments formula data family offset weights checks maxit CI viewiter datasources startbetas a character, a formula which describes the model to be fitted a character, the name of an optional data frame containing the variables in in the formula. The process stops if a non existing data frame is indicated. a description of the error distribution function to use in the model a character, null or a numeric vector that can be used to specify an a priori known component to be included in the linear predictor during fitting. a character, the name of an optional vector of prior weights to be used in the fitting process. Should be NULL or a numeric vector. a boolean, if TRUE (default) checks that takes 1-3min are carried out to verify that the variables in the model are defined (exist) on the server site and that they have the correct characteristics required to fit a GLM. The default value is FALSE because checks lengthen the runtime and are mainly meant to be # used as help to look for causes of eventual errors. the number of iterations of IWLS used instructions to each computer requesting non-disclosing summary statistics. The summaries are then combined to estimate the parameters of the model; these parameters are the same as those obtained if the data were physically pooled. a numeric, the confidence interval. a boolean, tells whether the results of the intermediate iterations should be printed on screen or not. Default is FALSE (i.e. only final results are shown). a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a dataframe, from opal datasources. starting values for the parameters in the linear predictor Details It enables a parallelized analysis of individual-level data sitting on distinct servers by sending Value coefficients a named vector of coefficients residuals the working residuals, that is the residuals in the final iteration of the IWLS fit. fitted.values the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function. rank the numeric rank of the fitted linear model. family the family object used. linear.predictors the linear fit on link scale. Author(s) Burton,P;Gaye,A;Laflamme,P

ds.lexis 5 See Also ds.lexis for survival analysis using piecewise exponential regression ds.gee for generalized estimating equation models { # load the file that contains the login details data(glmlogindata) # login and assign all the variables to R opals <- datashield.login(logins=glmlogindata, assign=true) # Example 1: run a GLM without interaction (e.g. diabetes prediction using BMI and HDL levels and GENDER) mod <- ds.glm(formula= D$DIS_DIAB~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 2: run the above GLM model without an intercept # (produces separate baseline estimates for Male and Female) mod <- ds.glm(formula= D$DIS_DIAB~0+D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 3: run the above GLM with interaction between GENDER and PM_BMI_CONTINUOUS mod <- ds.glm(formula= D$DIS_DIAB~D$GENDER*D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 4: Fit a standard Gaussian linear model with an interaction mod <- ds.glm(formula= D$PM_BMI_CONTINUOUS~D$DIS_DIAB*D$GENDER+D$LAB_HDL, family= gaussian ) mod # Example 5: now run a GLM where the error follows a poisson distribution # P.S: A poisson model requires a numeric vector as outcome so in this example we first convert # the categorical BMI, which is of type factor, into a numeric vector ds.asnumeric( D$PM_BMI_CATEGORICAL, BMI.123 ) mod <- ds.glm(formula= BMI.123~D$PM_BMI_CONTINUOUS+D$LAB_HDL+D$GENDER, family= poisson ) mod # clear the Datashield R sessions and logout datashield.logout(opals) } ds.lexis Generates an expanded version of a dataset that contains survival data This function is meant to be used as part of a piecewise regression analysis. ds.lexis(data = NULL, intervalwidth = NULL, idcol = NULL, entrycol = NULL, exitcol = NULL, statuscol = NULL, variables = NULL, newobj = NULL, datasources = NULL)

6 ds.lexis Arguments data Details Value a character, the name of the table that holds the original data, this is the data to be expanded. intervalwidth, a numeric vector which gives the chosen width of the intervals ( pieces ). This can be one value (in which case all the intervals have same width) or several different values. If no value(s) are provided a single default value is used. That default value is the set to be the 1/10th of the mean of the exit time values across all the studies. idcol entrycol exitcol statuscol variables newobj datasources a character the name of the column that holds the individual IDs of the subjects. a character, the name of the column that holds the entry times (i.e. start of follow up). If no name is provided the default is to set all the entry times to 0 in a column named "STARTTIME". A message is then printed to alert the user as this has serious consequences if the actual entry times are not 0 for all the subjects. a character, the name of the column that holds the exit times (i.e. end of follow up). a character, the name of the column that holds the failure status of each subject, tells whether or not a subject has been censored. a character vector, the column names of the variables (covariates) to include in the final expanded table. The input table might have a large number of covariates and if only some of those variables are relevant for the sought analysis it make sense to only include those. By default (i.e. if no variables are indicated) all the covariates in the inout table are included and this will lengthen the run time of the function. the name of the output expanded table. By default the name is the name of the input table with the suffixe "_expanded". a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a data frame, from opal datasources It splits the survial interval time of subjects into sub-intervals and reports the failure status of the subjects at each sub-interval. Each of those sub-interval is given an id e.g. if the overall interval of a subject is split into 4 sub-interval, those sub-intervals have ids 1, 2, 3 and 4; so this is basically the count of periods for each subject. The interval ids are held in a column named "TIMEID". The entry and exit times in the input table are used to compute the total survival time. By default all the covariates in the input table are included in the expanded output table but it is preferable to indicate the names of the covariates to be included via the argument variables. a dataframe, an expanded version of the input table. Author(s) Gaye, A.

ds.lexis 7 See Also ds.glm for genralized linear models ds.gee for generalized estimating equation models { # load the file that contains the login details data(survivallogindata) # login and assign all the variables to R opals <- datashield.login(logins=survivallogindata,assign=true) # this example shows how to run survival analysis in H-DataSHIELD using the piecewise exponential regression m # let us display the names of the variables in the original table (the table we assigned above and which by defau ds.colnames( D ) # specify some baseline hazard profile (i.e. the width of the intervals to be used) bh <- c(2,1,3,0.5,1.5,2) # expand the original table (e.g the survial time of each individual is split into pieces equal to the interval # we use the function ds.lexis which expands the original table and saves the expanded table on the server site # we set the parameter variables to NULL (default) which means include all the covariates in the expanded table # to indicate the variables to include if you have many variables and wants to use only a subset of those. ds.lexis(data= D, intervalwidth=bh, idcol="id", entrycol="starttime", exitcol="endtime", statuscol="cens") # let us display the names of variables in the expanded table (by default it is the name of the priginal table fo ds.colnames( D_expanded ) # Now fit a GLM with a poisson model # there is a direct relationship between the poisson model with a log-time offset and the exponential model so we # use glm to fit a poisson model and include a factor for the time intervals ( TIMEID ) to have different rates. # The vector SURVIVALTIME (the time elapsed between start of follow up failure/censoring) and the vector TIME # which allows for different rates are generated when the initial table got expanded via the function ds.lxus. # In the below model the log of the survival time is used as an offset (some known information to be included in t # generate a vector of log survival time values ds.assign(toassign= log(d_expanded$survivaltime), newobj= logsurvival ) # Fit the GLM - the outcome is failure status ds.glm(formula= CENS~1+TIMEID+AGE.60+GENDER+NOISE.56+PM10.16, data= D_expanded, family= poisson, offset= lo # clear the Datashield R sessions and logout datashield.logout(opals) }

8 geelogin_remoteserver geelogindata Information required to login to opal servers for the GEE test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(geelogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(geelogindata) geelogin_remoteserver Information required to login to opal servers for the GEE test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(geelogin_remoteserver) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse

glmlogindata 9 data(geelogin_remoteserver) glmlogindata Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(glmlogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(glmlogindata) glmlogin_remoteserver Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(glmlogin_remoteserver)

10 survivallogindata Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(glmlogin_remoteserver) survivallogindata Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(survivallogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(survivallogindata)

Index ds.gee, 1, 5 ds.glm, 3 ds.lexis, 5, 5 geelogin_remoteserver, 8 geelogindata, 8 glmlogin_remoteserver, 9 glmlogindata, 9 survivallogindata, 10 11