Package dsmodellingclient

Transcription

1 Package dsmodellingclient Maintainer Author Version License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD client site functions for statistical modelling Depends opal, dsbaseclient R topics documented: ds.gee ds.glm ds.lexis geelogindata geelogin_remoteserver glmlogindata glmlogin_remoteserver survivallogindata Index 11 ds.gee Fits a Generalized Estimating Equation (GEE) model A function that fits generalized estimated equations to deal with correlation structures arising from repeated measures on individuals, or from clustering as in family data. 1

2 2 ds.gee ds.gee(formula = NULL, family = NULL, data = NULL, corstructure = "ar1", clusterid = NULL, startcoeff = NULL, usermatrix = NULL, maxit = 20, checks = TRUE, display = FALSE, datasources = NULL) Arguments formula family data corstructure clusterid startcoeff usermatrix maxit checks display datasources a string character, the formula which describes the model to be fitted. a character, the description of the error distribution: binomial, gaussian, Gamma or poisson. the name of the data frame that hold the variables in the regression formula. a character, the correlation structure: ar1, exchangeable, independence, fixed or unstructure. a character, the name of the column that hold the cluster IDs a numeric vector, the starting values for the beta coefficients. a list of user defined matrix (one for each study). These matrices are required if the correlation structure is set to fixed. an integer, the maximum number of iteration to use for convergence. a boolean, if TRUE (default) checks that takes 1-3min are carried out to verify that the variables in the model are defined (exist) on the server site and that they have the correct characteristics required to fit a GEE. If FALSE (not recommended if you are not an experienced user) no checks are carried except some very basic ones and eventual error messages might not give clear indications about the cause(s) of the error. a boolean to display or not the intermediate results. Default is FALSE. a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a dataframe, from opal datasources. Details It enables a parallelized analysis of individual-level data sitting on distinct servers by sending commands to each data computer to fit a GEE model model. The estimates returned are then combined and updated coefficients estimate sent back for a new fit. This iterative process goes on until convergence is achieved. The input data should not contain missing values. The data must be in a data.frame obejct and the variables must be refer to through the data.frame. Value a list which contains the final coefficient estimates (beta values), the pooled alpha value and the pooled phi value. Author(s) Gaye, A.; Jones EM.

3 ds.glm 3 References Jones EM, Sheehan NA, Gaye A, Laflamme P, Burton P. Combined analysis of correlated data when data cannot be pooled. Stat 2013; 2: See Also ds.glm for genralized linear models ds.lexis for survival analysis using piecewise exponential regression { } # load the login data file for the correlated data data(geelogindata) # login and assign all the stored variables to R opals <- datashield.login(logins=geelogindata,assign=true) # set some parameters for the function 9the rest are set to default values) myformula <- response~1+sex+age.60 myfamily <- binomial startbetas <- c(-1,1,0) clusters <- id mycorr <- ar1 # run a GEE analysis with the above specifed parameters ds.gee(data= D,formula=myformula,family=myfamily,corStructure=mycorr,clusterID=clusters,startCoeff=startbeta # clear the Datashield R sessions and logout datashield.logout(opals) ds.glm Runs a combined GLM analysis of non-pooled data A function fit generalized linear models ds.glm(formula = NULL, data = NULL, family = NULL, offset = NULL, weights = NULL, checks = FALSE, maxit = 15, CI = 0.95, viewiter = FALSE, datasources = NULL)

4 4 ds.glm Arguments formula data family offset weights checks maxit CI viewiter datasources startbetas a character, a formula which describes the model to be fitted a character, the name of an optional data frame containing the variables in in the formula. The process stops if a non existing data frame is indicated. a description of the error distribution function to use in the model a character, null or a numeric vector that can be used to specify an a priori known component to be included in the linear predictor during fitting. a character, the name of an optional vector of prior weights to be used in the fitting process. Should be NULL or a numeric vector. a boolean, if TRUE (default) checks that takes 1-3min are carried out to verify that the variables in the model are defined (exist) on the server site and that they have the correct characteristics required to fit a GLM. The default value is FALSE because checks lengthen the runtime and are mainly meant to be # used as help to look for causes of eventual errors. the number of iterations of IWLS used instructions to each computer requesting non-disclosing summary statistics. The summaries are then combined to estimate the parameters of the model; these parameters are the same as those obtained if the data were physically pooled. a numeric, the confidence interval. a boolean, tells whether the results of the intermediate iterations should be printed on screen or not. Default is FALSE (i.e. only final results are shown). a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a dataframe, from opal datasources. starting values for the parameters in the linear predictor Details It enables a parallelized analysis of individual-level data sitting on distinct servers by sending Value coefficients a named vector of coefficients residuals the working residuals, that is the residuals in the final iteration of the IWLS fit. fitted.values the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function. rank the numeric rank of the fitted linear model. family the family object used. linear.predictors the linear fit on link scale. Author(s) Burton,P;Gaye,A;Laflamme,P

5 ds.lexis 5 See Also ds.lexis for survival analysis using piecewise exponential regression ds.gee for generalized estimating equation models { # load the file that contains the login details data(glmlogindata) # login and assign all the variables to R opals <- datashield.login(logins=glmlogindata, assign=true) # Example 1: run a GLM without interaction (e.g. diabetes prediction using BMI and HDL levels and GENDER) mod <- ds.glm(formula= D$DIS_DIAB~D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 2: run the above GLM model without an intercept # (produces separate baseline estimates for Male and Female) mod <- ds.glm(formula= D$DIS_DIAB~0+D$GENDER+D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 3: run the above GLM with interaction between GENDER and PM_BMI_CONTINUOUS mod <- ds.glm(formula= D$DIS_DIAB~D$GENDER*D$PM_BMI_CONTINUOUS+D$LAB_HDL, family= binomial ) mod # Example 4: Fit a standard Gaussian linear model with an interaction mod <- ds.glm(formula= D$PM_BMI_CONTINUOUS~D$DIS_DIAB*D$GENDER+D$LAB_HDL, family= gaussian ) mod # Example 5: now run a GLM where the error follows a poisson distribution # P.S: A poisson model requires a numeric vector as outcome so in this example we first convert # the categorical BMI, which is of type factor, into a numeric vector ds.asnumeric( D$PM_BMI_CATEGORICAL, BMI.123 ) mod <- ds.glm(formula= BMI.123~D$PM_BMI_CONTINUOUS+D$LAB_HDL+D$GENDER, family= poisson ) mod # clear the Datashield R sessions and logout datashield.logout(opals) } ds.lexis Generates an expanded version of a dataset that contains survival data This function is meant to be used as part of a piecewise regression analysis. ds.lexis(data = NULL, intervalwidth = NULL, idcol = NULL, entrycol = NULL, exitcol = NULL, statuscol = NULL, variables = NULL, newobj = NULL, datasources = NULL)

6 6 ds.lexis Arguments data Details Value a character, the name of the table that holds the original data, this is the data to be expanded. intervalwidth, a numeric vector which gives the chosen width of the intervals ( pieces ). This can be one value (in which case all the intervals have same width) or several different values. If no value(s) are provided a single default value is used. That default value is the set to be the 1/10th of the mean of the exit time values across all the studies. idcol entrycol exitcol statuscol variables newobj datasources a character the name of the column that holds the individual IDs of the subjects. a character, the name of the column that holds the entry times (i.e. start of follow up). If no name is provided the default is to set all the entry times to 0 in a column named "STARTTIME". A message is then printed to alert the user as this has serious consequences if the actual entry times are not 0 for all the subjects. a character, the name of the column that holds the exit times (i.e. end of follow up). a character, the name of the column that holds the failure status of each subject, tells whether or not a subject has been censored. a character vector, the column names of the variables (covariates) to include in the final expanded table. The input table might have a large number of covariates and if only some of those variables are relevant for the sought analysis it make sense to only include those. By default (i.e. if no variables are indicated) all the covariates in the inout table are included and this will lengthen the run time of the function. the name of the output expanded table. By default the name is the name of the input table with the suffixe "_expanded". a list of opal object(s) obtained after login to opal servers; these objects also hold the data assigned to R, as a data frame, from opal datasources It splits the survial interval time of subjects into sub-intervals and reports the failure status of the subjects at each sub-interval. Each of those sub-interval is given an id e.g. if the overall interval of a subject is split into 4 sub-interval, those sub-intervals have ids 1, 2, 3 and 4; so this is basically the count of periods for each subject. The interval ids are held in a column named "TIMEID". The entry and exit times in the input table are used to compute the total survival time. By default all the covariates in the input table are included in the expanded output table but it is preferable to indicate the names of the covariates to be included via the argument variables. a dataframe, an expanded version of the input table. Author(s) Gaye, A.

7 ds.lexis 7 See Also ds.glm for genralized linear models ds.gee for generalized estimating equation models { # load the file that contains the login details data(survivallogindata) # login and assign all the variables to R opals <- datashield.login(logins=survivallogindata,assign=true) # this example shows how to run survival analysis in H-DataSHIELD using the piecewise exponential regression m # let us display the names of the variables in the original table (the table we assigned above and which by defau ds.colnames( D ) # specify some baseline hazard profile (i.e. the width of the intervals to be used) bh <- c(2,1,3,0.5,1.5,2) # expand the original table (e.g the survial time of each individual is split into pieces equal to the interval # we use the function ds.lexis which expands the original table and saves the expanded table on the server site # we set the parameter variables to NULL (default) which means include all the covariates in the expanded table # to indicate the variables to include if you have many variables and wants to use only a subset of those. ds.lexis(data= D, intervalwidth=bh, idcol="id", entrycol="starttime", exitcol="endtime", statuscol="cens") # let us display the names of variables in the expanded table (by default it is the name of the priginal table fo ds.colnames( D_expanded ) # Now fit a GLM with a poisson model # there is a direct relationship between the poisson model with a log-time offset and the exponential model so we # use glm to fit a poisson model and include a factor for the time intervals ( TIMEID ) to have different rates. # The vector SURVIVALTIME (the time elapsed between start of follow up failure/censoring) and the vector TIME # which allows for different rates are generated when the initial table got expanded via the function ds.lxus. # In the below model the log of the survival time is used as an offset (some known information to be included in t # generate a vector of log survival time values ds.assign(toassign= log(d_expanded$survivaltime), newobj= logsurvival ) # Fit the GLM - the outcome is failure status ds.glm(formula= CENS~1+TIMEID+AGE.60+GENDER+NOISE.56+PM10.16, data= D_expanded, family= poisson, offset= lo # clear the Datashield R sessions and logout datashield.logout(opals) }

8 8 geelogin_remoteserver geelogindata Information required to login to opal servers for the GEE test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(geelogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(geelogindata) geelogin_remoteserver Information required to login to opal servers for the GEE test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(geelogin_remoteserver) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse

9 glmlogindata 9 data(geelogin_remoteserver) glmlogindata Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(glmlogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(glmlogindata) glmlogin_remoteserver Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(glmlogin_remoteserver)

10 10 survivallogindata Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(glmlogin_remoteserver) survivallogindata Information required to login to opal servers for the GLM test data A table of with 5 columns: study name, URL, username, password and opal datasource. data(survivallogindata) Format A data frame where the number of servers corresponds to the number of rows server a character, the formal name of the study url URL of the opal server user a character, a formal username or a path to a valid ssl certificate, if required password a character, a formal password or a path to a valid ssl key if required table a character, the path to the opal datasource that holds the data to analyse data(survivallogindata)

11 Index ds.gee, 1, 5 ds.glm, 3 ds.lexis, 5, 5 geelogin_remoteserver, 8 geelogindata, 8 glmlogin_remoteserver, 9 glmlogindata, 9 survivallogindata, 10 11