| Title: | Prediction Model Pooling, Selection and Performance Evaluation Across Multiply Imputed Datasets |
|---|---|
| Description: | Pooling, backward and forward selection of linear, logistic and Cox regression models in multiply imputed datasets. Backward and forward selection can be done from the pooled model using Rubin's Rules (RR), the D1, D2, D3, D4 and the median p-values method. This is also possible for Mixed models. The models can contain continuous, dichotomous, categorical and restricted cubic spline predictors and interaction terms between all these type of predictors. The stability of the models can be evaluated using (cluster) bootstrapping. The package further contains functions to pool model performance measures as ROC/AUC, Reclassification, R-squared, scaled Brier score, H&L test and calibration plots for logistic regression models. Internal validation can be done across multiply imputed datasets with cross-validation or bootstrapping. The adjusted intercept after shrinkage of pooled regression coefficients can be obtained. Backward and forward selection as part of internal validation is possible. A function to externally validate logistic prediction models in multiple imputed datasets is available and a function to compare models. For Cox models a strata variable can be included. Eekhout (2017) <doi:10.1186/s12874-017-0404-7>. Wiel (2009) <doi:10.1093/biostatistics/kxp011>. Marshall (2009) <doi:10.1186/1471-2288-9-57>. |
| Authors: | Martijn Heymans [cre, aut] (ORCID: <https://orcid.org/0000-0002-3889-0921>), Iris Eekhout [ctb] |
| Maintainer: | Martijn Heymans <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.4.0 |
| Built: | 2026-06-04 08:04:06 UTC |
| Source: | https://github.com/mwheymans/psfmi |
Data from a placebo-controlled RCT with leukemia patients
data(anderson)data(anderson)
A data frame with 348 observations on the following 5 variables.
remissioncontinuous:remission in weeks
statusdichotomous
treatmentdichotomous: 0=placebo, 1=verum
sexdichotomous: 0=female, 1=male
log_wbccontinuous: Log (number of white blood cells)
data(anderson) ## maybe str(anderson)data(anderson) ## maybe str(anderson)
Original dataset of patients with a aortadissection
data(aortadis)data(aortadis)
A data frame with 226 observations on the following 10 variables.
Genderdichotomous, 1=yes, 0=no
Agecontinuous
Age_Ccategorical: 0 = < 50 years, 1 = 50-59 years, 2 = 60-69 years, 3 = 70-79 years, 4 = 80 years and older
Aortadisdichotomous, 1=yes, 0=no
Acutedichotomous, 1=yes, 0=no
Acute3categorical: 0 = No, 1 = Little, 2 = Much
Stomach_Achedichotomous, 1=yes, 0=no
Hyperdichotomous, Hypertensio, 1=yes, 0=no
Smokingdichotomous, 1=yes, 0=no
Radiationdichotomous, 1=yes, 0=no
data(aortadis) ## maybe str(aortadis)data(aortadis) ## maybe str(aortadis)
Data of a non-experimental study in more than 300 elderly women
data(bmd)data(bmd)
A data frame with 348 observations on the following 5 variables.
bmdcontinuous
agecontinuous: years
menopauscontinuous: age of menopause
weightcontinuous: weight in kg
walkscordichotomous: score on a walking test, 0=normal, 1=impaired
data(bmd) ## maybe str(bmd)data(bmd) ## maybe str(bmd)
bw_single Backward selection of Linear and Logistic regression
models using as selection method the likelihood-ratio Chi-square value.
bw_single( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )bw_single( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )
data |
A data frame. |
formula |
A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. |
Outcome |
Character vector containing the name of the outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
model_type |
A character vector. If "binomial" a logistic regression model is used (default) and for "linear" a linear regression model is used. |
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+".
An object of class smods (single models) from
which the following objects can be extracted: original dataset as data, final selected
model as RR_model_final, model at each selection step RR_model_setp,
p-values at final step according to selection method as multiparm_final, and
at each step as multiparm_step, formula object at final step as formula_final,
and at each step as formula_step and for start model as formula_initial,
predictors included at each selection step as predictors_in, predictors excluded
at each step as predictors_out, and Outcome, anova_test, p.crit, call,
model_type, predictors_final for names of predictors in final selection step and
predictors_initial for names of predictors in start model.
Martijn Heymans, 2020
http://missingdatasolutions.rbind.io/
Data about concentration of ß2-microglobuline in urine as indicator for possible damage to the kidney
data(chlrform)data(chlrform)
A data frame with 348 observations on the following 5 variables.
pt_idcontinuous
sportcategorical: 0 = football player, 1 = outdoorswimmer and 2 = indoor swimmer)
gammagtcontinuous: liver damage
b2continuous: beta2 microglobuline in mg per mol
agecontinuous: age in years
data(chlrform) ## maybe str(chlrform)data(chlrform) ## maybe str(chlrform)
Long dataset of persons from the The Amsterdam Growth and Health Longitudinal Study (AGHLS)
data(chol_long)data(chol_long)
A data frame with 588 observations on the following 7 variables.
IDcontinuous
fitnesscontinuous
Smokingdichotomous, 1=yes, 0=no
Sexdichotomous
Timecategorical
Cholesterolcontinuous
SumSkinfoldscontinuous
data(chol_long) ## maybe str(chol_long)data(chol_long) ## maybe str(chol_long)
Wide dataset of persons from the The Amsterdam Growth and Health Longitudinal Study (AGHLS)
data(chol_wide)data(chol_wide)
A data frame with 147 observations on the following 7 variables.
IDcontinuous
Cholesterol1continuous
SumSkinfolds1continuous
Cholesterol2continuous
SumSkinfolds2continuous
Cholesterol3continuous
SumSkinfolds3continuous
Cholesterol4continuous
SumSkinfolds4continuous
fitnesscontinuous
Smokingdichotomous
Sexdichotomous
data(chol_wide) ## maybe str(chol_wide)data(chol_wide) ## maybe str(chol_wide)
coxph_bw Backward selection of Cox regression models in single complete dataset
using as selection method the partial likelihood-ratio statistic.
coxph_bw( data, formula = NULL, status = NULL, time = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL )coxph_bw( data, formula = NULL, status = NULL, time = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL )
data |
A data frame. |
formula |
A formula object to specify the model as normally used by coxph. See under "Details" and "Examples" how these can be specified. |
status |
The status variable, normally 0=censoring, 1=event. |
time |
Survival time. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
A typical formula object has the form Surv(time, status) ~ terms. Categorical variables has to
be defined as Surv(time, status) ~ factor(variable), restricted cubic spline variables as
Surv(time, status) ~ rcs(variable, 3). Interaction terms can be defined as
Surv(time, status) ~ variable1*variable2 or Surv(time, status) ~ variable1 + variable2 +
variable1:variable2. All variables in the terms part have to be separated by a "+".
An object of class smods (single models) from
which the following objects can be extracted: original dataset as data, final selected
model as RR_model_final, model at each selection step RR_model,
p-values at final step multiparm_final, and at each step as multiparm,
formula object at final step as formula_final,
and at each step as formula_step and for start model as formula_initial,
predictors included at each selection step as predictors_in, predictors excluded
at each step as predictors_out, and time, status, p.crit, call,
model_type, predictors_final for names of predictors in final selection step and
predictors_initial for names of predictors in start model and keep.predictors for
variables that are forced in the model during selection.
Martijn Heymans, 2021
http://missingdatasolutions.rbind.io/
lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract first imputed dataset res_single <- coxph_fw(data=lbpmicox1, p.crit = 0.05, formula=Surv(Time, Status) ~ Previous + Radiation + Onset + Age + Tampascale + Pain + JobControl + factor(Satisfaction), spline.predictors = "Function", nknots = 3) res_single$RR_model_final res_single$multiparm_finallbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract first imputed dataset res_single <- coxph_fw(data=lbpmicox1, p.crit = 0.05, formula=Surv(Time, Status) ~ Previous + Radiation + Onset + Age + Tampascale + Pain + JobControl + factor(Satisfaction), spline.predictors = "Function", nknots = 3) res_single$RR_model_final res_single$multiparm_final
coxph_bw Forward selection of Cox regression models in single complete
dataset using as selection method the partial likelihood-ratio statistic.
coxph_fw( data, formula = NULL, status = NULL, time = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL )coxph_fw( data, formula = NULL, status = NULL, time = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL )
data |
A data frame. |
formula |
A formula object to specify the model as normally used by coxph. See under "Details" and "Examples" how these can be specified. |
status |
The status variable, normally 0=censoring, 1=event. |
time |
Survival time. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
A typical formula object has the form Surv(time, status) ~ terms. Categorical variables has to
be defined as Surv(time, status) ~ factor(variable), restricted cubic spline variables as
Surv(time, status) ~ rcs(variable, 3). Interaction terms can be defined as
Surv(time, status) ~ variable1*variable2 or Surv(time, status) ~ variable1 + variable2 +
variable1:variable2. All variables in the terms part have to be separated by a "+".
An object of class smods (single models) from
which the following objects can be extracted: original dataset as data, final selected
model as RR_model_final, model at each selection step RR_model,
p-values at final step multiparm_final, and at each step as multiparm,
formula object at final step as formula_final,
and at each step as formula_step and for start model as formula_initial,
predictors included at each selection step as predictors_in, predictors excluded
at each step as predictors_out, and time, status, p.crit, call,
model_type, predictors_final for names of predictors in final selection step and
predictors_initial for names of predictors in start model and keep.predictors for
variables that are forced in the model during selection.
Martijn Heymans, 2021
http://missingdatasolutions.rbind.io/
lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract first imputed dataset res_single <- coxph_bw(data=lbpmicox1, p.crit = 0.05, formula=Surv(Time, Status) ~ Previous + Radiation + Onset + Age + Tampascale + Pain + JobControl + factor(Satisfaction), spline.predictors = "Function", nknots = 3) res_single$RR_model_final res_single$multiparm_finallbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract first imputed dataset res_single <- coxph_bw(data=lbpmicox1, p.crit = 0.05, formula=Surv(Time, Status) ~ Previous + Radiation + Onset + Age + Tampascale + Pain + JobControl + factor(Satisfaction), spline.predictors = "Function", nknots = 3) res_single$RR_model_final res_single$multiparm_final
Dataset of low back pain patients with missing values in 2 variables
data(day2_dataset4_mi)data(day2_dataset4_mi)
A data frame with 100 observations on the following 8 variables.
IDcontinuous: unique patient numbers
Paincontinuous: Pain intensity
Tampacontinuous: Fear of Movement scale
Functioncontinuous: Functional Status
JobSocialcontinuous
FABcontinuous: Fear Avoidance Beliefs
Genderdichotomous: 1 = male, 0 = female
Radiationdichotomous: 1 = yes, 0 = no
data(day2_dataset4_mi) ## maybe str(day2_dataset4_mi)data(day2_dataset4_mi) ## maybe str(day2_dataset4_mi)
glm_bw Backward selection of Linear and Logistic regression
models in single dataset using as selection method the likelihood-ratio test.
glm_bw( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )glm_bw( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )
data |
A data frame. |
formula |
A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. |
Outcome |
Character vector containing the name of the outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
model_type |
A character vector. If "binomial" a logistic regression model is used (default) and for "linear" a linear regression model is used. |
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+".
An object of class smods (single models) from
which the following objects can be extracted: original dataset as data,
model at each selection step RR_model, final selected model as RR_model_final,
p-values at final step multiparm_final, and at each step as multiparm,
formula object at final step as formula_final,
and at each step as formula_step and for start model as formula_initial,
predictors included at each selection step as predictors_in, predictors excluded
at each step as predictors_out, and Outcome, p.crit, call,
model_type, predictors_final for names of predictors in final selection step and
predictors_initial for names of predictors in start model and keep.predictors for
variables that are forced in the model during selection.
Martijn Heymans, 2021
http://missingdatasolutions.rbind.io/
data1 <- subset(psfmi::lbpmilr, Impnr==1) # extract first imputed dataset res_single <- glm_bw(data=data1, p.crit = 0.05, formula=Chronic ~ Tampascale + Smoking + factor(Satisfaction), model_type="binomial") res_single$RR_model_final res_single <- glm_bw(data=data1, p.crit = 0.05, formula=Pain ~ Tampascale + Smoking + factor(Satisfaction), model_type="linear") res_single$RR_model_finaldata1 <- subset(psfmi::lbpmilr, Impnr==1) # extract first imputed dataset res_single <- glm_bw(data=data1, p.crit = 0.05, formula=Chronic ~ Tampascale + Smoking + factor(Satisfaction), model_type="binomial") res_single$RR_model_final res_single <- glm_bw(data=data1, p.crit = 0.05, formula=Pain ~ Tampascale + Smoking + factor(Satisfaction), model_type="linear") res_single$RR_model_final
glm_fw Forward selection of Linear and Logistic regression
models in single dataset using as selection method the likelihood-ratio test statistic.
glm_fw( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )glm_fw( data, formula = NULL, Outcome = NULL, predictors = NULL, p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, model_type = "binomial" )
data |
A data frame. |
formula |
A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. |
Outcome |
Character vector containing the name of the outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the full model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
model_type |
A character vector. If "binomial" a logistic regression model is used (default) and for "linear" a linear regression model is used. |
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+".
An object of class smods (single models) from
which the following objects can be extracted: original dataset as data,
model at each selection step RR_model, final selected model as RR_model_final,
p-values at final step multiparm_final, and at each step as multiparm,
formula object at final step as formula_final,
and at each step as formula_step and for start model as formula_initial,
predictors included at each selection step as predictors_in, predictors excluded
at each step as predictors_out, and Outcome, p.crit, call,
model_type, predictors_final for names of predictors in final selection step and
predictors_initial for names of predictors in start model and keep.predictors for
variables that are forced in the model during selection.
Martijn Heymans, 2021
http://missingdatasolutions.rbind.io/
data1 <- subset(psfmi::lbpmilr, Impnr==1) # extract first imputed dataset res_single <- glm_fw(data=data1, p.crit = 0.05, formula=Chronic ~ Tampascale + Smoking + factor(Satisfaction), model_type="binomial") res_single$RR_model_final res_single <- glm_fw(data=data1, p.crit = 0.05, formula=Pain ~ Tampascale + Smoking + factor(Satisfaction), model_type="linear") res_single$RR_model_finaldata1 <- subset(psfmi::lbpmilr, Impnr==1) # extract first imputed dataset res_single <- glm_fw(data=data1, p.crit = 0.05, formula=Chronic ~ Tampascale + Smoking + factor(Satisfaction), model_type="binomial") res_single$RR_model_final res_single <- glm_fw(data=data1, p.crit = 0.05, formula=Pain ~ Tampascale + Smoking + factor(Satisfaction), model_type="linear") res_single$RR_model_final
Original dataset of elderly patients with a hip fracture
data(hipstudy)data(hipstudy)
A data frame with 426 observations on the following 18 variables.
pat_idcontinuous: unique patient numbers
Genderdichotomous: 1 = male, 0 = female
Agecontinuous: Years
Mobilitycategorical: 1 = No tools, 2 = Stick / walker, 3 = Wheelchair / bed
Dementiadichotomous: 2=yes, 1=no
Homecategorical: 1 = Independent, 2 = Elderly house, 3 = Nursering
Comorbiditycontinuous: Number of Co_morbidities (0-4)
ASAcontinuous: ASA score (1-4)
Hemoglobinecontinuous: Hemoglobine pre-operative
Leucocytescontinuous: Leucocytes preoperative
Thrombocytescontinuous: Thrombocytes preoperative
CRPcontinuous: C-reactive protein (CRP) preoperative
Creatininecontinuous: Creatinine preoperative
Ureacontinuous: Urea preoperative
Albuminecontinuous: Albumin preoperative
Fracturedichotomous: 1 = per or subtrochanter fracture, 0 = collum fracture
Delaycontinuous: time till operation in days
Mortalitydichotomous: 1 = yes, 0 = no
data(hipstudy) ## maybe str(hipstudy)data(hipstudy) ## maybe str(hipstudy)
External dataset of elderly patients with a hip fracture
data(hipstudy_external)data(hipstudy_external)
A data frame with 381 observations on the following 17 variables.
Genderdichotomous: 1 = male, 0 = female
Agecontinuous: Years
Mobilitycategorical: 1 = No tools, 2 = Stick / walker, 3 = Wheelchair / bed
Dementiadichotomous: 2=yes, 1=no
Homecategorical: 1 = Independent, 2 = Elderly house, 3 = Nursering
Comorbiditycontinuous: Number of Co-morbidities
ASAcontinuous: ASA score
Hemoglobinecontinuous: Hemoglobine preoperative
Leucocytescontinuous: Leucocytes preoperative
Thrombocytescontinuous: Thrombocytes preoperative
CRPcontinuous: Creactive protein (CRP) preoperative
Creatininecontinuous: Creatinine preoperative
Ureacontinuous: Urea preoperative
Albuminecontinuous: Albumin preoperative
Fracturedichotomous: 1 = per or subtrochanter fracture, 0 = collum fracture
Delaycontinuous: time till operation in days
Mortalitydichotomous: 1 = yes, 0 = no
data(hipstudy_external) ## maybe str(hipstudy_external)data(hipstudy_external) ## maybe str(hipstudy_external)
Dataset of the Hoorn Study
data(hoorn_basic)data(hoorn_basic)
A data frame with 250 observations on the following 12 variables.
patnrcontinuous
sbldsys1continuous: Systolic Blood Pressure 1
sbldsys2continuous: Systolic Blood Pressure 2
sbldds1continuous: Diastolic Blood Pressure 1
sbldds2continuous: Diastolic Blood Pressure 2
sexdichotomous: 1=male, 2=female
sfructocontinuous: fructosamine level in the blood
sglucncontinuous
dmknowndichotomous: 0=no, 1=yes
dmdietdichotomous: 0=no, 1=yes
infarctdichotomous: 0=no, 1=yes
hyptendichotomous: 0=no, 1=yes
data(hoorn_basic) ## maybe str(hoorn_basic)data(hoorn_basic) ## maybe str(hoorn_basic)
hoslem_test the Hosmer and Lemeshow goodness of fit test.
hoslem_test(y, yhat, g = 10)hoslem_test(y, yhat, g = 10)
y |
a vector of observations (0/1). |
yhat |
a vector of predicted probabilities. |
g |
Number of groups tested. Default is 10. Can not be < 3. |
The Chi-squared test statistic, the p-value, the observed and expected frequencies.
Martijn Heymans, 2021
Kleinman K and Horton NJ. (2014). SAS and R: Data Management, Statistical Analysis, and Graphics. 2nd Edition. Chapman & Hall/CRC.
fit <- glm(Mortality ~ Dementia + factor(Mobility) + ASA + Gender + Age, data=hipstudy, family=binomial) pred <- predict(fit, type = "response") hoslem_test(fit$y, pred)fit <- glm(Mortality ~ Dementia + factor(Mobility) + ASA + Gender + Age, data=hipstudy, family=binomial) pred <- predict(fit, type = "response") hoslem_test(fit$y, pred)
Data of a patient-control study regarding the relationship between MI and smoking
data(infarct)data(infarct)
A data frame with 420 observations on the following 10 variables.
ppnrcontinuous
infarctdichotomous: 1=yes, 0=no
smokingdichotomous: 1=yes, 0=no
alcoholcategorical
activedichotomous: 1=active, 0=inactive
sexdichotomous: 1=male, 0=female
professioncategorical: 1=epidemiologist, 2=statistician, 3=other
bmicontinuous: body mass index
syscontinuous: systolic blood pressure
diascontinuous: diastolic blood pressure
data(infarct) ## maybe str(infarct)data(infarct) ## maybe str(infarct)
5 imputed datasets of the first 10 centres of the IPDNa dataset in the micemd package.
data(ipdna_md)data(ipdna_md)
A data frame with 13390 observations on the following 13 variables.
.impa numeric vector
.ida numeric vector
centrecluster variable
genderdichotomous
bmicontinuous
agecontinuous
sbpcontinuous
dbpcontinuous
hrcontinuous
lvefdichotomous
bnpcategorical
afibcontinuous
bmi_catcategorical
data(ipdna_md) ## maybe str(ipdna_md) #summary per study by(ipdna_md, ipdna_md$centre, summary)data(ipdna_md) ## maybe str(ipdna_md) #summary per study by(ipdna_md, ipdna_md$centre, summary)
km_estimates Kaplan-Meier risk estimates for Net Reclassification Index analysis
for Cox Regression Models
km_estimates(data, p0, p1, time, status, t_risk, cutoff)km_estimates(data, p0, p1, time, status, t_risk, cutoff)
data |
Data frame with relevant predictors |
p0 |
risk outcome probabilities for reference model. |
p1 |
risk outcome probabilities for new model. |
time |
Character vector. Name of time variable. |
status |
Character vector. Name of status variable. |
t_risk |
Follow-up value to calculate cases, controls. See details. |
cutoff |
A numerical vector that defines the outcome probability cutoff values. |
Follow-up for which cases and controls are determined. For censored cases before this follow-up
the expected risk of being a case is calculated by using the Kaplan-Meier value to calculate
the expected number of cases. These expected numbers are used to calculate the NRI proportions.
(These are not shown by function nricens).
An object from which the following objects can be extracted:
data dataset.
prob_orig outcome risk probabilities at t_risk for reference model.
prob_new outcome risk probabilities at t_risk for new model.
time name of time variable.
status name of status variable.
cutoff cutoff value for survival probability.
t_risk follow-up time used to calculate outcome (risk) probabilities.
reclass_totals table with total reclassification numbers.
reclass_cases table with reclassification numbers for cases.
reclass_controls table with reclassification numbers for controls.
totals totals of controls, cases, censored cases.
km_est totals of cases calculated using Kaplan-Meiers risk estimates.
nri_est reclassification measures.
Martijn Heymans, 2023
Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med. 2009;150(11):795-802.
Steyerberg EW, Pencina MJ. Reclassification calculations for persons with incomplete follow-up. Ann Intern Med. 2010;152(3):195-6 (author reply 196-7).
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11-21
Inoue E (2018). nricens: NRI for Risk Prediction Models with Time to Event and Binary Response Data. R package version 1.6, <https://CRAN.R-project.org/package=nricens>.
library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract dataset fit_cox0 <- coxph(Surv(Time, Status) ~ Duration + Pain, data=lbpmicox1, x=TRUE) fit_cox1 <- coxph(Surv(Time, Status) ~ Duration + Pain + Function + Radiation, data=lbpmicox1, x=TRUE) p0 <- risk_coxph(fit_cox0, t_risk=80) p1 <- risk_coxph(fit_cox1, t_risk=80) res_km <- km_estimates(data=lbpmicox1, p0=p0, p1=p1, time = "Time", status = "Status", cutoff=0.45, t_risk=80)library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract dataset fit_cox0 <- coxph(Surv(Time, Status) ~ Duration + Pain, data=lbpmicox1, x=TRUE) fit_cox1 <- coxph(Surv(Time, Status) ~ Duration + Pain + Function + Radiation, data=lbpmicox1, x=TRUE) p0 <- risk_coxph(fit_cox0, t_risk=80) p1 <- risk_coxph(fit_cox1, t_risk=80) res_km <- km_estimates(data=lbpmicox1, p0=p0, p1=p1, time = "Time", status = "Status", cutoff=0.45, t_risk=80)
Kaplan-Meier (KM) estimate at specific time point
km_fit(time, status, t_risk)km_fit(time, status, t_risk)
time |
Character vector. Name of time variable. |
status |
Character vector. Name of status variable. |
t_risk |
Follow-up value to calculate cases, controls. See details. |
KM estimate at specific time point
Martijn Heymans, 2023
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11-21
Inoue E (2018). nricens: NRI for Risk Prediction Models with Time to Event and Binary Response Data. R package version 1.6, <https://CRAN.R-project.org/package=nricens>.
Original dataset with missing values
data(lbp_orig)data(lbp_orig)
A data frame with 159 observations on the following 15 variables.
Chronicdichotomous
Genderdichotomous
Carryingcategorical
Paincontinuous
Tampascalecontinuous
Functioncontinuous
Radiationdichotomous
Agecontinuous
Smokingdichotomous
Satisfactioncategorical
JobControlcontinuous
JobDemandscontinuous
SocialSupportcontinuous
Durationcontinuous
BMIcontinuous
data(lbp_orig) ## maybe str(lbp_orig)data(lbp_orig) ## maybe str(lbp_orig)
Five multiply imputed datasets
lbpmi_extvallbpmi_extval
A data frame with 400 rows and 17 variables.
Impnra numeric vector
IDa numeric vector
Chronicdichotomous
Genderdichotomous
Carryingcategorical
Paincontinuous
Tampascalecontinuous
Functioncontinuous
Radiationdichotomous
Agecontinuous
Smokingdichotomous
Satisfactioncategorical
JobControlcontinuous
JobDemandscontinuous
SocialSupportcontinuous
Durationcontinuous
BMIcontinuous
data(lbpmi_extval) ## maybe str(lbpmi_extval)\data(lbpmi_extval) ## maybe str(lbpmi_extval)\
10 imputed datasets
data(lbpmicox)data(lbpmicox)
A data frame with 2650 observations on the following 18 variables.
Impnra numeric vector
patnra numeric vector
Statusdichotomous event
Timecontinuous follow up time variable
Durationcontinuous
Previousdichotomous
Radiationdichotomous
Onsetdichotomous
Agecontinuous
Tampascalecontinuous
Paincontinuous
Functioncontinuous
Satisfactioncategorical
JobControlcontinuous
JobDemandcontinuous
Socialcontinuous
Expectationa numeric vector
Expect_catcategorical
data(lbpmicox) ## maybe str(lbpmicox)data(lbpmicox) ## maybe str(lbpmicox)
10 imputed datasets
data(lbpmilr)data(lbpmilr)
A data frame with 1590 observations on the following 17 variables.
Impnra numeric vector
IDa numeric vector
Chronicdichotomous
Genderdichotomous
Carryingcategorical
Paincontinuous
Tampascalecontinuous
Functioncontinuous
Radiationdichotomous
Agecontinuous
Smokingdichotomous
Satisfactioncategorical
JobControlcontinuous
JobDemandscontinuous
SocialSupportcontinuous
Durationcontinuous
BMIcontinuous
data(lbpmilr) ## maybe str(lbpmilr)data(lbpmilr) ## maybe str(lbpmilr)
1 development dataset
data(lbpmilr_dev)data(lbpmilr_dev)
A data frame with 108 observations on the following 16 variables.
IDa numeric vector
Chronicdichotomous
Genderdichotomous
Carryingcategorical
Paincontinuous
Tampascalecontinuous
Functioncontinuous
Radiationdichotomous
Agecontinuous
Smokingdichotomous
Satisfactioncategorical
JobControlcontinuous
JobDemandscontinuous
SocialSupportcontinuous
Durationcontinuous
BMIcontinuous
data(lbpmilr_dev) ## maybe str(lbpmilr_dev)data(lbpmilr_dev) ## maybe str(lbpmilr_dev)
Data regarding the development of lung and heartvolume of unborn babies in the 18 till 34 week of pregnancy
data(lungvolume)data(lungvolume)
A data frame with 152 observations on the following 6 variables.
pat_idcontinuous
weekcontinuous: week pregnancy
weightcontinuous: weight in grams
lungvolcontinuous: lung volume
heartvolcontinuous: heart volume
Nweekcategorical: Percentile Group of week
data(lungvolume) ## maybe str(lungvolume)data(lungvolume) ## maybe str(lungvolume)
Data of a study among women with breast cancer
data(mammaca)data(mammaca)
A data frame with 1207 observations on the following 10 variables.
idcontinuous
timecontinuous, Time (months)
statusdichotomous: 1=yes, 0=no
erEstrogen Receptor Status, 1=positive, 0=negative
agecontinuous
histgradcategorical
ln_yesnolymph nodes, 0=no, 1=yes
pathsddichotomous: Pathological Tumor Size
prdichotomous: Progesterone Receptor Status, 0=negative, 1=positive
data(mammaca) ## maybe str(mammaca)data(mammaca) ## maybe str(mammaca)
Data of 613 patients with meningitis
data(men)data(men)
A data frame with 420 observations on the following 10 variables.
pt_idcontinuous
sexdichotomous: 0=male, 1=female
predispdichotomous: 0=no, 1=yes
mensepsicategorical: disease characteristics at admission, 1=menigitis, 2=sepsis, 3=other
comadichotomous: coma at admission, 0=no, 1=coma
diastolcontinuous: diastolic blood pressure at admission
coursedichotomous: disease course, 0=alive, 1=deceased
data(men) ## maybe str(men)data(men) ## maybe str(men)
mivalext_lr External validation of logistic prediction models
mivalext_lr( data.val = NULL, data.orig = NULL, nimp = 5, impvar = NULL, formula = NULL, lp.orig = NULL, cal.plot = FALSE, plot.indiv, val.check = FALSE, g = 10, groups_cal = 10, plot.method = "mean" )mivalext_lr( data.val = NULL, data.orig = NULL, nimp = 5, impvar = NULL, formula = NULL, lp.orig = NULL, cal.plot = FALSE, plot.indiv, val.check = FALSE, g = 10, groups_cal = 10, plot.method = "mean" )
data.val |
Data frame with stacked multiply imputed validation datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1. |
data.orig |
A single data frame containing the original dataset that was used to develop the model. Used to estimate the original regression coefficients in case lp.orig is not provided. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
formula |
A formula object to specify the model as normally used by glm. |
lp.orig |
Numeric vector of the original coefficient values that are externally validated. |
cal.plot |
If TRUE a calibration plot is generated. Default is FALSE. |
plot.indiv |
This argument is deprecated; please use plot.method instead. |
val.check |
logical vector. If TRUE the names of the predictors of the LP are provided and can be used as information for the order of the coefficient values as input for lp.orig. If FALSE (default) validation procedure is executed with coefficient values fitted in the order as used under lp.orig. |
g |
A numerical scalar. Number of groups for the Hosmer and Lemeshow test. Default is 10. |
groups_cal |
A numerical scalar. Number of groups used on the calibration plot. Default is 10. If the range of predicted probabilities is low, less than 10 groups can be chosen. |
plot.method |
If "mean" one calibration plot is generated, first taking the mean of the linear predictor values across the multiply imputed datasets (default), if "individual" the calibration plot in each imputed dataset is plotted, if "overlay" calibration plots from each imputed datasets are plotted in one figure. |
The following information of the externally validated model is provided:
calibrate with information of pooled_int and pooled_slope that is
the pooled linear predictor (LP), after the LP is freely estimated in each external imputed
dataset Outcome ~ a + LP (provides information about miscalibration in intercept
and slope), pooled_offset_int as Outcome ~ a + offset(LP) and
pooled_offset_slope as Outcome ~ a + LP + offset(LP) with information
about miscalibration in intercept and slope separately by using an offset procedure
(see Steyerberg, p. 300), coef_pooled with the pooled coefficients when the model
is freely estimated in imputed datasets, ROC pooled ROC curve (back transformed
after pooling log transformed ROC curves), R2 pooled Nagelkerke R-Square value
(back transformed after pooling Fisher transformed values), HLtest pooled Hosmer
and Lemeshow Test (using function pool_D2). In addition information is provided about
nimp, impvar, formula, val_ckeck, g and coef_check.
When the external validation is very poor, the R2 can become negative due to the poor fit of
the model in the external dataset (in that case you may report a R2 of zero).
A mivalext_lr object from which the following objects
can be extracted: calibrate with information about
mis-calibration in intercept and slope with and without offset procedure,
coef_pooled, coefficients pooled, ROC results as ROC,
R squared results as R2, Hosmer and Lemeshow test as HL_test,
nimp, formula, impvar, val.check, g,
coef.check and groups_cal.
F. Harrell. Regression Modeling Strategies. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd Edition. Springer, New York, NY, 2015.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
http://missingdatasolutions.rbind.io/
mivalext_lr(data.val=lbpmilr, nimp=5, impvar="Impnr", formula = Chronic ~ Gender + factor(Carrying) + Function + Tampascale + Age, lp.orig=c(-10, -0.35, 1.00, 1.00, -0.04, 0.26, -0.01), cal.plot=TRUE, val.check = FALSE)mivalext_lr(data.val=lbpmilr, nimp=5, impvar="Impnr", formula = Chronic ~ Gender + factor(Carrying) + Function + Tampascale + Age, lp.orig=c(-10, -0.35, 1.00, 1.00, -0.04, 0.26, -0.01), cal.plot=TRUE, val.check = FALSE)
nri_cox Net Reclassification Index for Cox Regression Models
nri_cox(data, formula0, formula1, t_risk, cutoff, B = FALSE, nboot = 10)nri_cox(data, formula0, formula1, t_risk, cutoff, B = FALSE, nboot = 10)
data |
Data frame with relevant predictors |
formula0 |
A formula object to specify the reference model as normally used by glm. See under "Details" and "Examples" how these can be specified. |
formula1 |
A formula object to specify the new model as normally used by glm. |
t_risk |
Follow-up value to calculate cases, controls. See details. |
cutoff |
A numerical vector that defines the outcome probability cutoff values. |
B |
A logical scalar. If TRUE bootstrap confidence intervals are calculated, if FALSE only the NRI estimates are reported. |
nboot |
A numerical scalar. Number of bootstrap samples to derive the percentile bootstrap confidence intervals. Default is 10. |
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+". If a formula
object is used set predictors, cat.predictors, spline.predictors or int.predictors
at the default value of NULL.
Follow-up for which cases nd controls are determined. For censored cases before this follow-up
the expected risk of being a case is calculated by using the Kaplan-Meier value to calculate
the expected number of cases.These expected numbers are used to calculate the NRI proportions
but are not shown by function nricens.
An object from which the following objects can be extracted:
data dataset.
prob_orig outcome risk probabilities at t_risk for reference model.
prob_new outcome risk probabilities at t_risk for new model.
time name of time variable.
status name of status variable.
cutoff cutoff value for survival probability.
t_risk follow-up time used to calculate outcome (risk) probabilities.
reclass_totals table with total reclassification numbers.
reclass_cases table with reclassification numbers for cases.
reclass_controls table with reclassification numbers for controls.
totals totals of controls, cases, censored cases.
km_est totals of cases calculated using Kaplan-Meiers risk estimates.
nri_est reclassification measures.
Martijn Heymans, 2023
Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med. 2009;150(11):795-802.
Steyerberg EW, Pencina MJ. Reclassification calculations for persons with incomplete follow-up. Ann Intern Med. 2010;152(3):195-6; author reply 196-7.
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11-21
Inoue E (2018). nricens: NRI for Risk Prediction Models with Time to Event and Binary Response Data. R package version 1.6, <https://CRAN.R-project.org/package=nricens>.
library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract one dataset risk_est <- nri_cox(data=lbpmicox1, formula0 = Surv(Time, Status) ~ Duration + Pain, formula1 = Surv(Time, Status) ~ Duration + Pain + Function + Radiation, t_risk = 80, cutoff=c(0.45), B=TRUE, nboot=10)library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract one dataset risk_est <- nri_cox(data=lbpmicox1, formula0 = Surv(Time, Status) ~ Duration + Pain, formula1 = Surv(Time, Status) ~ Duration + Pain + Function + Radiation, t_risk = 80, cutoff=c(0.45), B=TRUE, nboot=10)
nri_est Calculation of proportion of Reclassified persons and NRI for Cox
Regression Models
nri_est(data, p0, p1, time, status, t_risk, cutoff)nri_est(data, p0, p1, time, status, t_risk, cutoff)
data |
Data frame with relevant predictors |
p0 |
risk outcome probabilities for reference model. |
p1 |
risk outcome probabilities for new model. |
time |
Character vector. Name of time variable. |
status |
Character vector. Name of status variable. |
t_risk |
Follow-up value to calculate cases, controls. See details. |
cutoff |
A numerical vector that defines the outcome probability cutoff values. |
Follow-up for which cases nd controls are determined. For censored cases before this follow-up
the expected risk of being a case is calculated by using the Kaplan-Meier value to calculate
the expected number of cases. These expected numbers are used to calculate the NRI proportions
but are not shown by function nricens.
An object from which the following objects can be extracted:
prop_up_case proportion of cases reclassified upwards.
prop_down_case proportion of cases reclassified downwards.
prop_up_ctr proportion of controls reclassified upwards.
prop_down_ctr proportion of controls reclassified downwards.
nri_plus proportion reclassified for events.
nri_min proportion reclassified for nonevents.
nri net reclassification improvement.
Martijn Heymans, 2023
Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med. 2009;150(11):795-802.
Steyerberg EW, Pencina MJ. Reclassification calculations for persons with incomplete follow-up. Ann Intern Med. 2010;152(3):195-6; author reply 196-7.
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11-21
Inoue E (2018). nricens: NRI for Risk Prediction Models with Time to Event and Binary Response Data. R package version 1.6, <https://CRAN.R-project.org/package=nricens>.
library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract dataset fit_cox0 <- coxph(Surv(Time, Status) ~ Duration + Pain, data=lbpmicox1, x=TRUE) fit_cox1 <- coxph(Surv(Time, Status) ~ Duration + Pain + Function + Radiation, data=lbpmicox1, x=TRUE) p0 <- risk_coxph(fit_cox0, t_risk=80) p1 <- risk_coxph(fit_cox1, t_risk=80) nri <- nri_est(data=lbpmicox1, p0=p0, p1=p1, time = "Time", status = "Status", cutoff=0.45, t_risk=80)library(survival) lbpmicox1 <- subset(psfmi::lbpmicox, Impnr==1) # extract dataset fit_cox0 <- coxph(Surv(Time, Status) ~ Duration + Pain, data=lbpmicox1, x=TRUE) fit_cox1 <- coxph(Surv(Time, Status) ~ Duration + Pain + Function + Radiation, data=lbpmicox1, x=TRUE) p0 <- risk_coxph(fit_cox0, t_risk=80) p1 <- risk_coxph(fit_cox1, t_risk=80) nri <- nri_est(data=lbpmicox1, p0=p0, p1=p1, time = "Time", status = "Status", cutoff=0.45, t_risk=80)
pool_auc Calculates the pooled C-statistic and 95
by using Rubin's Rules. The C-statistic values are log transformed before pooling.
pool_auc(est_auc, est_se, nimp = 5, log_auc = TRUE)pool_auc(est_auc, est_se, nimp = 5, log_auc = TRUE)
est_auc |
A list of C-statistic (AUC/ROC) values estimated in Multiply Imputed datasets. |
est_se |
A list of standard errors of C-statistic values estimated in Multiply Imputed datasets. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
log_auc |
If TRUE natural logarithmic transformation is applied before pooling and finally back transformed. If FALSE the raw values are pooled. |
The pooled C-statistic value and the 95
Martijn Heymans, 2021
psfmi_perform, pool_performance
pool_compare_model Compares the fit and performance of prediction models
in multiply imputed data sets by using clinical important performance measures
pool_compare_models( pobj, compare.predictors = NULL, compare.group = NULL, cutoff = 0.5, boot_auc = FALSE, nboot = 1000 )pool_compare_models( pobj, compare.predictors = NULL, compare.group = NULL, cutoff = 0.5, boot_auc = FALSE, nboot = 1000 )
pobj |
An object of class |
compare.predictors |
Character vector with the names of the predictors that are compared. See details. |
compare.group |
Character vector with the names of the group of predictors that are compared. See details. |
cutoff |
A numerical scalar. Cutoff used for the categorical NRI value. More than one cutoff value can be used. |
boot_auc |
If TRUE the standard error of the AUC is calculated with stratified bootstrapping. If FALSE (is default), the standard error is calculated with De Long's method. |
nboot |
A numerical scalar. The number of bootstrap samples for the AUC standard error, used when boot_auc is TRUE. Default is 1000. |
The fit of the models are compared by using the D3 method for pooling Likelihood ratio
statistics (method of Meng and Rubin). The pooled AIC difference is calculated according to
the formula AIC = D - 2*p, where D is the pooled likelihood ratio tests of
constrained models (numerator in D3 statistic) and p is the difference in number of parameters
between the full and restricted models that are compared. The pooled AUC difference
is calculated, after the standard error is obtained in each imputed data set by method
DeLong or bootstrapping. The NRI categorical and continuous and IDI are calculated in each
imputed data set and pooled.
An object from which the following objects can be extracted:
DR_stats p-value of the D3 statistic, the D3 statistic, LRT fixed is the
likelihood Ratio test value of the constrained models.
stats_compare Mean of LogLik0, LogLik1, AIC0, AIC1, AIC_diff values of the
restricted (containing a 0) and full models (containing a 1).
NRI pooled values for the categorical and continuous Net Reclassification
improvement values and the Integrated Discrimination improvement.
AUC_stats Pooled Area Under the Curve of restricted and full models.
AUC_diff Pooled difference in AUC.
formula_test regression formula of full model.
cutoff Cutoff value used for reclassification values.
formula_null regression formula of null model
compare_predictors Predictors used in full model.
compare_group group of predictors used in full model.
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Consentino F, Claeskens G. Order Selection tests with multiply imputed data Computational Statistics and Data Analysis.2010;54:2284-2295.
pool_lr <- psfmi_lr(data=lbpmilr, p.crit = 1, direction="FW", nimp=10, impvar="Impnr", Outcome="Chronic", predictors=c("Radiation"), cat.predictors = ("Satisfaction"), int.predictors = NULL, spline.predictors="Tampascale", nknots=3, method="D1") res_compare <- pool_compare_models(pool_lr, compare.predictors = c("Pain", "Duration", "Function"), cutoff = 0.4) res_comparepool_lr <- psfmi_lr(data=lbpmilr, p.crit = 1, direction="FW", nimp=10, impvar="Impnr", Outcome="Chronic", predictors=c("Radiation"), cat.predictors = ("Satisfaction"), int.predictors = NULL, spline.predictors="Tampascale", nknots=3, method="D1") res_compare <- pool_compare_models(pool_lr, compare.predictors = c("Pain", "Duration", "Function"), cutoff = 0.4) res_compare
pool_D2 The D2 statistic to combine the Chi square values
across Multiply Imputed datasets.
pool_D2(dw, v)pool_D2(dw, v)
dw |
a vector of Chi square values obtained after multiple imputation. |
v |
single value for the degrees of freedom of the Chi square statistic. |
The pooled chi square values as the D2 statistic, the p-value, the numerator, df1 and denominator, df2 degrees of freedom for the F-test.
Martijn Heymans, 2021
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
pool_D2(c(2.25, 3.95, 6.24, 5.27, 2.81), 4)pool_D2(c(2.25, 3.95, 6.24, 5.27, 2.81), 4)
pool_D4 The D4 statistic to combine the likelihood ratio tests (LRT)
across Multiply Imputed datasets according method D4.
pool_D4(data, nimp, impvar, fm0, fm1, robust = TRUE, model_type = "binomial")pool_D4(data, nimp, impvar, fm0, fm1, robust = TRUE, model_type = "binomial")
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
fm0 |
the null model. |
fm1 |
the (nested) model to compare. Must be larger than the null model. |
robust |
if TRUE a robust LRT is used (algorithm 1 in Chan and Meng), otherwise algorithm 2 is used. |
model_type |
if TRUE (default) a logistic regression model is fitted, otherwise a linear regression model is used |
The D4 statistic, the numerator, df1 and denominator, df2 degrees of freedom for the F-test.
Martijn Heymans, 2021
Chan, K. W., & Meng, X.-L. (2019). Multiple improvements of multiple imputation likelihood ratio tests. ArXiv:1711.08822 [Math, Stat]. https://arxiv.org/abs/1711.08822
Grund, Simon, Oliver Lüdtke, and Alexander Robitzsch. 2021. “Pooling Methods for Likelihood Ratio Tests in Multiply Imputed Data Sets.” PsyArXiv. January 29. doi:10.31234/osf.io/d459g.
fm0 <- Chronic ~ BMI + factor(Carrying) + Satisfaction + SocialSupport + Smoking fm1 <- Chronic ~ BMI + factor(Carrying) + Satisfaction + SocialSupport + Smoking + Radiation psfmi::pool_D4(data=lbpmilr, nimp=10, impvar="Impnr", fm0=fm0, fm1=fm1, robust = TRUE)fm0 <- Chronic ~ BMI + factor(Carrying) + Satisfaction + SocialSupport + Smoking fm1 <- Chronic ~ BMI + factor(Carrying) + Satisfaction + SocialSupport + Smoking + Radiation psfmi::pool_D4(data=lbpmilr, nimp=10, impvar="Impnr", fm0=fm0, fm1=fm1, robust = TRUE)
pool_intadj Provides pooled adjusted intercept after shrinkage of the pooled coefficients
in multiply imputed datasets for models selected with the psfmi_lr function and
internally validated with the psfmi_perform function.
pool_intadj(pobj, shrinkage_factor)pool_intadj(pobj, shrinkage_factor)
pobj |
An object of class |
shrinkage_factor |
A numerical scalar. Shrinkage factor value as a result of internal validation
with the |
The function provides the pooled adjusted intercept after shrinkage of pooled regression coefficients in multiply imputed datasets. The function is only available for logistic regression models without random effects.
A pool_intadj object from which the following objects can be extracted: int_adj,
the adjusted intercept value, coef_shrink_pooled, the pooled regression coefficients
after shrinkage, coef_orig_pooled, the (original) pooled regression coefficients before
shrinkage and nimp, the number of imputed datasets.
F. Harrell. Regression Modeling Strategies. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd edition). Springer, New York, NY, 2015.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
res_psfmi <- psfmi_lr(data=lbpmilr, nimp=5, impvar="Impnr", Outcome="Chronic", predictors=c("Gender", "Pain","Tampascale","Smoking","Function", "Radiation", "Age"), p.crit = 1, method="D1", direction="BW") res_psfmi$RR_Model ## Not run: set.seed(100) res_val <- psfmi_perform(res_psfmi, method = "MI_boot", nboot=10, int_val = TRUE, p.crit=1, cal.plot=FALSE, plot.indiv=FALSE) res_val$intval res <- pool_intadj(res_psfmi, shrinkage_factor = 0.9774058) res$int_adj res$coef_shrink_pooled ## End(Not run)res_psfmi <- psfmi_lr(data=lbpmilr, nimp=5, impvar="Impnr", Outcome="Chronic", predictors=c("Gender", "Pain","Tampascale","Smoking","Function", "Radiation", "Age"), p.crit = 1, method="D1", direction="BW") res_psfmi$RR_Model ## Not run: set.seed(100) res_val <- psfmi_perform(res_psfmi, method = "MI_boot", nboot=10, int_val = TRUE, p.crit=1, cal.plot=FALSE, plot.indiv=FALSE) res_val$intval res <- pool_intadj(res_psfmi, shrinkage_factor = 0.9774058) res$int_adj res$coef_shrink_pooled ## End(Not run)
pool_performance Pooling performance measures for logistic
and Cox regression models.
pool_performance( data, formula, nimp, impvar, plot.indiv, model_type = "binomial", cal.plot = TRUE, plot.method = "mean", groups_cal = 10 )pool_performance( data, formula, nimp, impvar, plot.indiv, model_type = "binomial", cal.plot = TRUE, plot.method = "mean", groups_cal = 10 )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. |
formula |
A formula object to specify the model as normally used by glm or coxph. See details. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
plot.indiv |
This argument is deprecated; please use plot.method instead. |
model_type |
If "binomial" (default), performance measures are calculated for logistic regression models, if "survival" for Cox regression models. See details. |
cal.plot |
If TRUE a calibration plot is generated. Default is TRUE. model_type must be "binomial". |
plot.method |
If "mean" one calibration plot is generated, first taking the mean of the linear predictor across the multiply imputed datasets (default), if "individual" the calibration plot of each imputed dataset is plotted, if "overlay" calibration plots from each imputed datasets are plotted in one figure. |
groups_cal |
A numerical scalar. Number of groups used on the calibration plot and. for the Hosmer and Lemeshow test. Default is 10. If the range of predicted probabilities. is low, less than 10 groups can be chosen, but not < 3. |
A typical formula object for logistic regression models has the form
formula = Outcome ~ terms. For Cox regression models the formula object must
be defined as Surv(time, status) ~ terms. For Cox models calibration curves
can not be generated.
perf <- pool_performance(data=lbpmilr, nimp=5, impvar="Impnr", formula = Chronic ~ Gender + Pain + Tampascale + Smoking + Function + Radiation + Age + factor(Carrying), cal.plot=TRUE, plot.method="mean", groups_cal=10, model_type="binomial") perf$ROC_pooled perf$R2_pooledperf <- pool_performance(data=lbpmilr, nimp=5, impvar="Impnr", formula = Chronic ~ Gender + Pain + Tampascale + Smoking + Function + Radiation + Age + factor(Carrying), cal.plot=TRUE, plot.method="mean", groups_cal=10, model_type="binomial") perf$ROC_pooled perf$R2_pooled
pool_reclassification Function to pool categorical and continuous NRI
and IDI over Multiply Imputed datasets
pool_reclassification(datasets, cutoff = cutoff)pool_reclassification(datasets, cutoff = cutoff)
datasets |
a list of data frames corresponding to the multiply imputed datasets, within each dataset in the first column the predicted probabilities of model 1, in the second column those of model 2 and in the third column the observed outcomes coded as '0'and '1'. |
cutoff |
cutoff value for the categorical NRI, must lie between 0 and 1. |
This function is called by the function pool_compare_model
Martijn Heymans, 2020
pool_RR Rubin's Rules
pool_RR(est, se, conf.level = 0.95, n, k)pool_RR(est, se, conf.level = 0.95, n, k)
est |
A vector of multiple parameter estimates |
se |
A vector of multiple standard error estimates |
conf.level |
desired confidence limits |
n |
sample size in completed dataset |
k |
number of parameters to pool |
Martijn Heymans, 2021
psfmi_coxr Pooling and backward or forward selection of Cox regression
prediction models in multiply imputed data using selection methods D1, D2 and MPR.
psfmi_coxr( data, formula = NULL, nimp = 5, impvar = NULL, time, status, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, strata.variable = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )psfmi_coxr( data, formula = NULL, nimp = 5, impvar = NULL, time, status, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, strata.variable = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1. |
formula |
A formula object to specify the model as normally used by coxph. See under "Details" and "Examples" how these can be specified. If a formula object is used set predictors, cat.predictors, spline.predictors or int.predictors at the default value of NULL. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
time |
Survival time. |
status |
The status variable, normally 0=censoring, 1=event. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gnder10, etc. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. Categorical and interaction variables are allowed. |
strata.variable |
A single string including the strata variable. See under "Details" and "Examples" how such a variable can be specified. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
p.crit |
A numerical scalar. P-value selection criterion. A value of 1 provides the pooled model without selection. |
method |
A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "RR", D1", "D2", or "MPR". See details for more information. Default is "RR". |
direction |
The direction of predictor selection, "BW" means backward selection and "FW" means forward selection. |
The basic pooling procedure to derive pooled coefficients, standard errors, 95 confidence intervals and p-values is Rubin's Rules (RR). However, RR is only possible when the model included continuous or dichotomous variables. Specific procedures are available when the model also included categorical (> 2 categories) or restricted cubic spline variables. These pooling methods are: “D1” is pooling of the total covariance matrix, ”D2” is pooling of Chi-square values and “MPR” is pooling of median p-values (MPR rule). Spline regression coefficients are defined by using the rcs function for restricted cubic splines of the rms package. A minimum number of 3 knots as defined under knots is required.
A typical formula object has the form Surv(time, status) ~ terms. Categorical variables has to
be defined as Surv(time, status) ~ factor(variable), restricted cubic spline variables as
Surv(time, status) ~ rcs(variable, 3). Interaction terms can be defined as
Surv(time, status) ~ variable1*variable2 or Surv(time, status) ~ variable1 + variable2 +
variable1:variable2. All variables in the terms part have to be separated by a "+". If a formula
object is used set predictors, cat.predictors, spline.predictors or int.predictors
at the default value of NULL. For Cox models also a strata variable is allowed to include in
the formula as Surv(time, status) ~ strata(variable) + terms.
An object of class pmods (multiply imputed models) from
which the following objects can be extracted:
data imputed datasets
RR_model pooled model at each selection step
RR_model_final final selected pooled model
multiparm pooled p-values at each step according to pooling method
multiparm_final pooled p-values at final step according to pooling method
multiparm_out (only when direction = "FW") pooled p-values of removed predictors
formula_step formula object at each step
formula_final formula object at final step
formula_initial formula object at final step
predictors_in predictors included at each selection step
predictors_out predictors excluded at each step
impvar name of variable used to distinguish imputed datasets
nimp number of imputed datasets
status name of the status variable
time name of the time variable
method selection method
p.crit p-value selection criterium
call function call
model_type type of regression model used
direction direction of predictor selection
predictors_final names of predictors in final selection step
predictors_initial names of predictors in start model
keep.predictors names of predictors that were forced in the model
strata.variable names of the strata variable in the model
https://mwheymans.github.io/psfmi/articles/psfmi_CoxModels.html
Martijn Heymans, 2020
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Enders CK (2010). Applied missing data analysis. New York: The Guilford Press.
van de Wiel MA, Berkhof J, van Wieringen WN. Testing the prediction error difference between 2 predictors. Biostatistics. 2009;10:550-60.
Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol. 2009;9:57.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
pool_coxr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + Tampascale + Radiation + Radiation*Pain + Age + Duration + Previous, data=lbpmicox, p.crit = 0.05, direction="BW", nimp=5, impvar="Impnr", keep.predictors = "Radiation*Pain", method="D1") pool_coxr$RR_model_final pool_coxr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + Tampascale + Previous + strata(Radiation), data=lbpmicox, p.crit = 0.05, direction="BW", nimp=5, impvar="Impnr", method="D1") pool_coxr$RR_model_finalpool_coxr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + Tampascale + Radiation + Radiation*Pain + Age + Duration + Previous, data=lbpmicox, p.crit = 0.05, direction="BW", nimp=5, impvar="Impnr", keep.predictors = "Radiation*Pain", method="D1") pool_coxr$RR_model_final pool_coxr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + Tampascale + Previous + strata(Radiation), data=lbpmicox, p.crit = 0.05, direction="BW", nimp=5, impvar="Impnr", method="D1") pool_coxr$RR_model_final
psfmi_lm Pooling and backward or forward selection of Linear regression
models in multiply imputed data using selection methods RR, D1, D2 and MPR.
psfmi_lm( data, formula = NULL, nimp = 5, impvar = NULL, Outcome = NULL, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )psfmi_lm( data, formula = NULL, nimp = 5, impvar = NULL, Outcome = NULL, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1. |
formula |
A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. If a formula object is used set predictors, cat.predictors, spline.predictors or int.predictors at the default value of NULL. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
Outcome |
Character vector containing the name of the continuous outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gender10, etc. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
method |
A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "RR", D1", "D2", "D3" or "MPR". See details for more information. Default is "RR". |
direction |
The direction of predictor selection, "BW" means backward selection and "FW" means forward selection. |
The basic pooling procedure to derive pooled coefficients, standard errors, 95 confidence intervals and p-values is Rubin's Rules (RR). However, RR is only possible when the model included continuous or dichotomous variables. Specific procedures are available when the model also included categorical (> 2 categories) or restricted cubic spline variables. These pooling methods are: “D1” is pooling of the total covariance matrix, ”D2” is pooling of Chi-square values and “MPR” is pooling of median p-values (MPR rule). Spline regression coefficients are defined by using the rcs function for restricted cubic splines of the rms package. A minimum number of 3 knots as defined under knots is required.
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+". If a formula
object is used set predictors, cat.predictors, spline.predictors or int.predictors
at the default value of NULL.
An object of class pmods (multiply imputed models) from
which the following objects can be extracted:
data imputed datasets
RR_model pooled model at each selection step
RR_model_final final selected pooled model
multiparm pooled p-values at each step according to pooling method
multiparm_final pooled p-values at final step according to pooling method
multiparm_out (only when direction = "FW") pooled p-values of removed predictors
formula_step formula object at each step
formula_final formula object at final step
formula_initial formula object at final step
predictors_in predictors included at each selection step
predictors_out predictors excluded at each step
impvar name of variable used to distinguish imputed datasets
nimp number of imputed datasets
Outcome name of the outcome variable
method selection method
p.crit p-value selection criterium
call function call
model_type type of regression model used
direction direction of predictor selection
predictors_final names of predictors in final selection step
predictors_initial names of predictors in start model
keep.predictors names of predictors that were forced in the model
Martijn Heymans, 2021
Enders CK (2010). Applied missing data analysis. New York: The Guilford Press.
van de Wiel MA, Berkhof J, van Wieringen WN. Testing the prediction error difference between 2 predictors. Biostatistics. 2009;10:550-60.
Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol. 2009;9:57.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
pool_lm <- psfmi_lm(data=lbpmilr, formula = Pain ~ factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + BMI, p.crit = 0.05, direction="FW", nimp=5, impvar="Impnr", keep.predictors = c("Radiation*factor(Satisfaction)", "Age"), method="D1") pool_lm$RR_model_finalpool_lm <- psfmi_lm(data=lbpmilr, formula = Pain ~ factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + BMI, p.crit = 0.05, direction="FW", nimp=5, impvar="Impnr", keep.predictors = c("Radiation*factor(Satisfaction)", "Age"), method="D1") pool_lm$RR_model_final
psfmi_lr Pooling and backward or forward selection of Logistic regression
models across multiply imputed data using selection methods RR, D1, D2, D3, D4 and MPR.
psfmi_lr( data, formula = NULL, nimp = 5, impvar = NULL, Outcome = NULL, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )psfmi_lr( data, formula = NULL, nimp = 5, impvar = NULL, Outcome = NULL, predictors = NULL, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, p.crit = 1, method = "RR", direction = NULL )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1. |
formula |
A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. If a formula object is used set predictors, cat.predictors, spline.predictors or int.predictors at the default value of NULL. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
Outcome |
Character vector containing the name of the outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. Give predictors unique names and do not use predictor name combinations with numbers as, age2, gender10, etc. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
method |
A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "RR", D1", "D2", "D3", "D4", or "MPR". See details for more information. Default is "RR". |
direction |
The direction of predictor selection, "BW" means backward selection and "FW" means forward selection. |
The basic pooling procedure to derive pooled coefficients, standard errors, 95 confidence intervals and p-values is Rubin's Rules (RR). However, RR is only possible when the model included continuous or dichotomous variables. Specific procedures are available when the model also included categorical (> 2 categories) or restricted cubic spline variables. These pooling methods are: “D1” is pooling of the total covariance matrix, ”D2” is pooling of Chi-square values, “D3” and "D4" is pooling Likelihood ratio statistics (method of Meng and Rubin) and “MPR” is pooling of median p-values (MPR rule). Spline regression coefficients are defined by using the rcs function for restricted cubic splines of the rms package. A minimum number of 3 knots as defined under knots is required.
A typical formula object has the form Outcome ~ terms. Categorical variables has to
be defined as Outcome ~ factor(variable), restricted cubic spline variables as
Outcome ~ rcs(variable, 3). Interaction terms can be defined as
Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2.
All variables in the terms part have to be separated by a "+". If a formula
object is used set predictors, cat.predictors, spline.predictors or int.predictors
at the default value of NULL.
An object of class pmods (multiply imputed models) from
which the following objects can be extracted:
data imputed datasets
RR_model pooled model at each selection step
RR_model_final final selected pooled model
multiparm pooled p-values at each step according to pooling method
multiparm_final pooled p-values at final step according to pooling method
multiparm_out (only when direction = "FW") pooled p-values of removed predictors
formula_step formula object at each step
formula_final formula object at final step
formula_initial formula object at final step
predictors_in predictors included at each selection step
predictors_out predictors excluded at each step
impvar name of variable used to distinguish imputed datasets
nimp number of imputed datasets
Outcome name of the outcome variable
method selection method
p.crit p-value selection criterium
call function call
model_type type of regression model used
direction direction of predictor selection
predictors_final names of predictors in final selection step
predictors_initial names of predictors in start model
keep.predictors names of predictors that were forced in the model
https://mwheymans.github.io/psfmi/articles/psfmi_LogisticModels.html
Martijn Heymans, 2020
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Enders CK (2010). Applied missing data analysis. New York: The Guilford Press.
Meng X-L, Rubin DB. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika.1992;79:103-11.
van de Wiel MA, Berkhof J, van Wieringen WN. Testing the prediction error difference between 2 predictors. Biostatistics. 2009;10:550-60.
Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol. 2009;9:57.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
pool_lr <- psfmi_lr(data=lbpmilr, formula = Chronic ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + BMI, p.crit = 0.05, direction="FW", nimp=5, impvar="Impnr", keep.predictors = c("Radiation*factor(Satisfaction)", "Age"), method="D1") pool_lr$RR_model_finalpool_lr <- psfmi_lr(data=lbpmilr, formula = Chronic ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + BMI, p.crit = 0.05, direction="FW", nimp=5, impvar="Impnr", keep.predictors = c("Radiation*factor(Satisfaction)", "Age"), method="D1") pool_lr$RR_model_final
psfmi_mm Pooling and backward selection for 2 level (generalized)
linear mixed models in multiply imputed datasets using different selection methods.
psfmi_mm( data, nimp = 5, impvar = NULL, clusvar = NULL, Outcome, predictors = NULL, random.eff = NULL, family = "linear", p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, method = "RR", print.method = FALSE )psfmi_mm( data, nimp = 5, impvar = NULL, clusvar = NULL, Outcome, predictors = NULL, random.eff = NULL, family = "linear", p.crit = 1, cat.predictors = NULL, spline.predictors = NULL, int.predictors = NULL, keep.predictors = NULL, nknots = NULL, method = "RR", print.method = FALSE )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1 and the clusters should be distinguished by a cluster variable, specified under clusvar. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
clusvar |
A character vector. Name of the variable that distinguishes the clusters. |
Outcome |
Character vector containing the name of the outcome variable. |
predictors |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. |
random.eff |
Character vector to specify the random effects as used by the
|
family |
Character vector to specify the type of model, "linear" is used to
call the |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
cat.predictors |
A single string or a vector of strings to define the categorical variables. Default is NULL categorical predictors. |
spline.predictors |
A single string or a vector of strings to define the (restricted cubic) spline variables. Default is NULL spline predictors. See details. |
int.predictors |
A single string or a vector of strings with the names of the variables that form an interaction pair, separated by a “:” symbol. |
keep.predictors |
A single string or a vector of strings including the variables that are forced in the model during predictor selection. Categorical and interaction variables are allowed. |
nknots |
A numerical vector that defines the number of knots for each spline predictor separately. |
method |
A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "D1", "D2", "D3" or "MPR". See details for more information. |
print.method |
logical vector. If TRUE full matrix with p-values of all variables according to chosen method (under method) is shown. If FALSE (default) p-value for categorical variables according to method are shown and for continuous and dichotomous predictors Rubin’s Rules are used. |
The basic pooling procedure to derive pooled coefficients, standard errors, 95
confidence intervals and p-values is Rubin's Rules (RR). Specific procedures are
available to derive pooled p-values for categorical (> 2 categories) and spline variables.
print.method allows to choose between the pooling methods: D1, D2 and D3 and MPR for pooling of
median p-values (MPR rule). The D1, D2 and D3 methods are called from the package mitml.
For Logistic multilevel models (that are estimated using the glmer function), the D3 method
is not yet available. Spline regression coefficients are defined by using the rcs function for
restricted cubic splines of the rms package. A minimum number of 3 knots as defined under knots is required.
An object of class smodsmi (selected models in multiply imputed datasets) from
which the following objects can be extracted: imputed datasets as data, selected
pooled model as RR_model, pooled p-values according to pooling method as multiparm,
random effects as random.eff, predictors included at each selection step as predictors_in,
predictors excluded at each step as predictors_out, and family, impvar, clusvar,
nimp, Outcome, method, p.crit, predictors, cat.predictors,
keep.predictors, int.predictors, spline.predictors, knots, print.method,
model_type, call, predictors_final for names of predictors in final step and
fit.formula is the regression formula of start model.
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Enders CK (2010). Applied missing data analysis. New York: The Guilford Press.
Meng X-L, Rubin DB. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika.1992;79:103-11.
van de Wiel MA, Berkhof J, van Wieringen WN. Testing the prediction error difference between 2 predictors. Biostatistics. 2009;10:550-60.
mitml package https://cran.r-project.org/web/packages/mitml/index.html
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
http://missingdatasolutions.rbind.io/
## Not run: pool_mm <- psfmi_mm(data=ipdna_md, nimp=5, impvar=".imp", family="linear", predictors=c("gender", "afib", "sbp"), clusvar = "centre", random.eff="( 1 | centre)", Outcome="dbp", cat.predictors = "bmi_cat", p.crit=0.15, method="D1", print.method = FALSE) pool_mm$RR_Model pool_mm$multiparm ## End(Not run)## Not run: pool_mm <- psfmi_mm(data=ipdna_md, nimp=5, impvar=".imp", family="linear", predictors=c("gender", "afib", "sbp"), clusvar = "centre", random.eff="( 1 | centre)", Outcome="dbp", cat.predictors = "bmi_cat", p.crit=0.15, method="D1", print.method = FALSE) pool_mm$RR_Model pool_mm$multiparm ## End(Not run)
psfmi_mm_multiparm Function to pool according to D1, D2 and D3 methods
psfmi_mm_multiparm( data, nimp, impvar, Outcome, P, p.crit, family, random.eff, method, print.method )psfmi_mm_multiparm( data, nimp, impvar, Outcome, P, p.crit, family, random.eff, method, print.method )
data |
Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1 and the clusters should be distinguished by a cluster variable, specified under clusvar. |
nimp |
A numerical scalar. Number of imputed datasets. Default is 5. |
impvar |
A character vector. Name of the variable that distinguishes the imputed datasets. |
Outcome |
Character vector containing the name of the outcome variable. |
P |
Character vector with the names of the predictor variables. At least one predictor variable has to be defined. |
p.crit |
A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection. |
family |
Character vector to specify the type of model, "linear" is used to
call the |
random.eff |
Character vector to specify the random effects as used by the
|
method |
A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "D1", "D2", "D3" or "MPR". See details for more information. |
print.method |
logical vector. If TRUE full matrix with p-values of all variables according to chosen method (under method) is shown. If FALSE (default) p-value for categorical variables according to method are shown and for continuous and dichotomous predictors Rubin’s Rules are used. |
## Not run: psfmi_mm_multiparm(data=ipdna_md, nimp=5, impvar=".imp", family="linear", P=c("gender", "bnp", "dbp", "lvef", "bmi_cat"), random.eff="( 1 | centre)", Outcome="sbp", p.crit=0.05, method="D1", print.method = FALSE) ## End(Not run)## Not run: psfmi_mm_multiparm(data=ipdna_md, nimp=5, impvar=".imp", family="linear", P=c("gender", "bnp", "dbp", "lvef", "bmi_cat"), random.eff="( 1 | centre)", Outcome="sbp", p.crit=0.05, method="D1", print.method = FALSE) ## End(Not run)
psfmi_perform Evaluate Performance of logistic regression models selected with
the psfmi_lr function of the psfmi package by using cross-validation
or bootstrapping.
psfmi_perform( pobj, val_method = NULL, data_orig = NULL, int_val = TRUE, nboot = 10, folds = 3, nimp_cv = 5, nimp_mice = 5, p.crit = 1, BW = FALSE, direction = NULL, cv_naive_appt = FALSE, cal.plot = FALSE, plot.method = "mean", groups_cal = 5, miceImp, ... )psfmi_perform( pobj, val_method = NULL, data_orig = NULL, int_val = TRUE, nboot = 10, folds = 3, nimp_cv = 5, nimp_mice = 5, p.crit = 1, BW = FALSE, direction = NULL, cv_naive_appt = FALSE, cal.plot = FALSE, plot.method = "mean", groups_cal = 5, miceImp, ... )
pobj |
An object of class |
val_method |
Method for internal validation. MI_boot for first Multiple Imputation and than bootstrapping in each imputed dataset and boot_MI for first bootstrapping and than multiple imputation in each bootstrap sample, and cv_MI, cv_MI_RR and MI_cv_naive for the combinations of cross-validation and multiple imputation. To use cv_MI, cv_MI_RR and boot_MI, data_orig has to be specified. See details for more information. |
data_orig |
dataframe of original dataset that contains missing data for methods cv_MI, cv_MI_RR and boot_MI. |
int_val |
If TRUE internal validation is conducted using bootstrapping or cross-validation. Default is TRUE. If FALSE only apparent performance measures are calculated. |
nboot |
The number of bootstrap resamples, default is 10. Used for methods boot_MI and MI_boot. |
folds |
The number of folds, default is 3. Used for methods cv_MI, cv_MI_RR and MI_cv_naive. |
nimp_cv |
Numerical scalar. Number of (multiple) imputation runs for method cv_MI. |
nimp_mice |
Numerical scalar. Number of imputed datasets for method cv_MI_RR and boot_MI.
When not defined, the number of multiply imputed datasets is used of the
previous call to the function |
p.crit |
A numerical scalar. P-value selection criterium used for backward or forward selection during validation. When set at 1, pooling and internal validation is done without backward selection. |
BW |
Only used for methods cv_MI, cv_MI_RR and MI_cv_naive. If TRUE backward selection is conducted within cross-validation. Default is FALSE. |
direction |
Can be used together with val_methods boot_MI and MI_boot. The direction of predictor selection, "BW" is for backward selection and "FW" for forward selection. |
cv_naive_appt |
Can be used in combination with val_method MI_cv_naive. Default is TRUE for showing the cross-validation apparent (train) and test results. Set to FALSE to only give test results. |
cal.plot |
If TRUE a calibration plot is generated. Default is FALSE. Can be used in combination with int_val = FALSE. |
plot.method |
If "mean" one calibration plot is generated, first taking the mean of the linear predictor across the multiply imputed datasets (default), if "individual" the calibration plot of each imputed dataset is plotted, if "overlay" calibration plots from each imputed datasets are plotted in one figure. |
groups_cal |
A numerical scalar. Number of groups used on the calibration plot and. for the Hosmer and Lemeshow test. Default is 10. If the range of predicted probabilities. is low, less than 10 groups can be chosen, but not < 3. |
miceImp |
Wrapper function around the |
... |
Arguments as predictorMatrix, seed, maxit, etc that can be adjusted for
the |
For internal validation five methods can be used, cv_MI, cv_MI_RR, MI_cv_naive,
MI_boot and boot_MI. Method cv_MI uses imputation within each cross-validation fold definition.
By repeating this in several imputation runs, multiply imputed datasets are generated. Method
cv_MI_RR uses multiple imputation within the cross-validation definition. MI_cv_naive, applies
cross-validation within each imputed dataset. MI_boot draws for each bootstrap step the same
cases in all imputed datasets. With boot_MI first bootstrap samples are drawn from the original
dataset with missing values and than multiple imputation is applied. For multiple imputation
the mice function from the mice package is used. It is recommended to use a minumum
of 100 imputation runs for method cv_MI or 100 bootstrap samples for method boot_MI or MI_boot.
Methods cv_MI, cv_MI_RR and MI_cv_naive can be combined with backward selection during
cross-validation and with methods boot_MI and MI_boot, backward and forward selection can
be used. For methods cv_MI and cv_MI_RR the outcome in the original dataset has to be complete.
A psfmi_perform object from which the following objects can be extracted: res_boot,
result of pooled performance (in multiply imputed datasets) at each bootstrap step of ROC app (pooled
ROC), ROC test (pooled ROC after bootstrap model is applied in original multiply imputed datasets),
same for R2 app (Nagelkerke's R2), R2 test, Scaled Brier app and Scaled Brier test. Information is also provided
about testing the Calibration slope at each bootstrap step as interc test and Slope test.
The performance measures are pooled by a call to the function pool_performance. Another
object that can be extracted is intval, with information of the AUC, R2, Scaled Brier score and
Calibration slope averaged over the bootstrap samples, in terms of: Orig (original datasets),
Apparent (models applied in bootstrap samples), Test (bootstrap models are applied in original datasets),
Optimism (difference between apparent and test) and Corrected (original corrected for optimism).
Martijn Heymans, 2020
Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HC. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007(13);7:33.
F. Harrell. Regression Modeling Strategies. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd edition). Springer, New York, NY, 2015.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
Harel, O. (2009). The estimation of R2 and adjusted R2 in incomplete data sets using multiple imputation. Journal of Applied Statistics, 36(10), 1109-1118.
Musoro JZ, Zwinderman AH, Puhan MA, ter Riet G, Geskus RB. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Methodol. 2014;14:116.
Wahl S, Boulesteix AL, Zierer A, Thorand B, van de Wiel MA. Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation. BMC Med Res Methodol. 2016;16(1):144.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
psfmi_stab Stability analysis of predictors and prediction models selected with
the psfmi_lr, psfmi_coxr or psfmi_mm functions of the psfmi package.
psfmi_stab( pobj, boot_method = NULL, nboot = 20, p.crit = 0.05, start_model = TRUE, direction = NULL )psfmi_stab( pobj, boot_method = NULL, nboot = 20, p.crit = 0.05, start_model = TRUE, direction = NULL )
pobj |
An object of class |
boot_method |
A single string to define the bootstrap method. Use "single" after a call to
|
nboot |
A numerical scalar. Number of bootstrap samples to evaluate the stability. Default is 20. |
p.crit |
A numerical scalar. Used as P-value selection criterium during bootstrap model selection. |
start_model |
If TRUE the bootstrap evaluation takes place from the start model of object pobj, if FALSE the final model is used for the evaluation. |
direction |
The direction of predictor selection, "BW" for backward selection and "FW" for forward selection. #' |
The function evaluates predictor selection frequency in stratified or cluster bootstrap samples.
The stratification factor is the variable that separates the imputed datasets. The same bootstrap cases
are drawn in each bootstrap sample. It uses as input an object of class pmods as a result of a
previous call to the psfmi_lr, psfmi_coxr or psfmi_mm functions.
In combination with the psfmi_mm function a cluster bootstrap method is used where bootstrapping
is used on the level of the clusters only (and not also within the clusters).
A psfmi_stab object from which the following objects can be extracted: bootstrap
inclusion (selection) frequency of each predictor bif, total number each predictor is
included in the bootstrap samples as bif_total, percentage a predictor is selected
in each bootstrap sample as bif_perc and number of times a prediction model is selected in
the bootstrap samples as model_stab.
https://mwheymans.github.io/psfmi/articles/psfmi_StabilityAnalysis.html
Heymans MW, van Buuren S. et al. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007;13:7-33.
Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.
Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: application to the Cox regression model. Stat Med. 1992;11:2093–109.
Royston P, Sauerbrei W (2008) Multivariable model-building – a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. (2008). Chapter 8, Model Stability. Wiley, Chichester
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431-449.
http://missingdatasolutions.rbind.io/
pool_lr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + Previous + Radiation*rcs(Tampascale, 3), data=lbpmicox, p.crit = 0.157, direction="FW", nimp=5, impvar="Impnr", keep.predictors = NULL, method="D1") pool_lr$RR_Model pool_lr$multiparm ## Not run: stab_res <- psfmi_stab(pool_lr, direction="FW", start_model = TRUE, boot_method = "single", nboot=20, p.crit=0.05) stab_res$bif stab_res$bif_perc stab_res$model_stab ## End(Not run)pool_lr <- psfmi_coxr(formula = Surv(Time, Status) ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Radiation + Radiation*factor(Satisfaction) + Age + Duration + Previous + Radiation*rcs(Tampascale, 3), data=lbpmicox, p.crit = 0.157, direction="FW", nimp=5, impvar="Impnr", keep.predictors = NULL, method="D1") pool_lr$RR_Model pool_lr$multiparm ## Not run: stab_res <- psfmi_stab(pool_lr, direction="FW", start_model = TRUE, boot_method = "single", nboot=20, p.crit=0.05) stab_res$bif stab_res$bif_perc stab_res$model_stab ## End(Not run)
psfmi_validate Evaluate Performance of logistic regression models selected with
the psfmi_lr function of the psfmi package by using cross-validation
or bootstrapping.
psfmi_validate( pobj, val_method = NULL, data_orig = NULL, int_val = TRUE, nboot = 10, folds = 3, nimp_cv = 5, nimp_mice = 5, p.crit = 1, BW = FALSE, direction = NULL, cv_naive_appt = FALSE, cal.plot = FALSE, plot.method = "mean", groups_cal = 5, miceImp, ... )psfmi_validate( pobj, val_method = NULL, data_orig = NULL, int_val = TRUE, nboot = 10, folds = 3, nimp_cv = 5, nimp_mice = 5, p.crit = 1, BW = FALSE, direction = NULL, cv_naive_appt = FALSE, cal.plot = FALSE, plot.method = "mean", groups_cal = 5, miceImp, ... )
pobj |
An object of class |
val_method |
Method for internal validation. MI_boot for first Multiple Imputation and than bootstrapping in each imputed dataset and boot_MI for first bootstrapping and than multiple imputation in each bootstrap sample, and cv_MI, cv_MI_RR and MI_cv_naive for the combinations of cross-validation and multiple imputation. To use cv_MI, cv_MI_RR and boot_MI, data_orig has to be specified. See details for more information. |
data_orig |
dataframe of original dataset that contains missing data for methods cv_MI, cv_MI_RR and boot_MI. |
int_val |
If TRUE internal validation is conducted using bootstrapping or cross-validation. Default is TRUE. If FALSE only apparent performance measures are calculated. |
nboot |
The number of bootstrap resamples, default is 10. Used for methods boot_MI and MI_boot. |
folds |
The number of folds, default is 3. Used for methods cv_MI, cv_MI_RR and MI_cv_naive. |
nimp_cv |
Numerical scalar. Number of (multiple) imputation runs for method cv_MI. |
nimp_mice |
Numerical scalar. Number of imputed datasets for method cv_MI_RR and boot_MI.
When not defined, the number of multiply imputed datasets is used of the
previous call to the function |
p.crit |
A numerical scalar. P-value selection criterium used for backward or forward selection during validation. When set at 1, pooling and internal validation is done without backward selection. |
BW |
Only used for methods cv_MI, cv_MI_RR and MI_cv_naive. If TRUE backward selection is conducted within cross-validation. Default is FALSE. |
direction |
Can be used together with val_methods boot_MI and MI_boot. The direction of predictor selection, "BW" is for backward selection and "FW" for forward selection. |
cv_naive_appt |
Can be used in combination with val_method MI_cv_naive. Default is TRUE for showing the cross-validation apparent (train) and test results. Set to FALSE to only give test results. |
cal.plot |
If TRUE a calibration plot is generated. Default is FALSE. Can be used in combination with int_val = FALSE. |
plot.method |
If "mean" one calibration plot is generated, first taking the mean of the linear predictor across the multiply imputed datasets (default), if "individual" the calibration plot of each imputed dataset is plotted, if "overlay" calibration plots from each imputed datasets are plotted in one figure. |
groups_cal |
A numerical scalar. Number of groups used on the calibration plot and. for the Hosmer and Lemeshow test. Default is 10. If the range of predicted probabilities. is low, less than 10 groups can be chosen, but not < 3. |
miceImp |
Wrapper function around the |
... |
Arguments as predictorMatrix, seed, maxit, etc that can be adjusted for
the |
For internal validation five methods can be used, cv_MI, cv_MI_RR, MI_cv_naive,
MI_boot and boot_MI. Method cv_MI uses imputation within each cross-validation fold definition.
By repeating this in several imputation runs, multiply imputed datasets are generated. Method
cv_MI_RR uses multiple imputation within the cross-validation definition. MI_cv_naive, applies
cross-validation within each imputed dataset. MI_boot draws for each bootstrap step the same
cases in all imputed datasets. With boot_MI first bootstrap samples are drawn from the original
dataset with missing values and than multiple imputation is applied. For multiple imputation
the mice function from the mice package is used. It is recommended to use a minumum
of 100 imputation runs for method cv_MI or 100 bootstrap samples for method boot_MI or MI_boot.
Methods cv_MI, cv_MI_RR and MI_cv_naive can be combined with backward selection during
cross-validation and with methods boot_MI and MI_boot, backward and forward selection can
be used. For methods cv_MI and cv_MI_RR the outcome in the original dataset has to be complete.
A psfmi_perform object from which the following objects can be extracted: res_boot,
result of pooled performance (in multiply imputed datasets) at each bootstrap step of ROC app (pooled
ROC), ROC test (pooled ROC after bootstrap model is applied in original multiply imputed datasets),
same for R2 app (Nagelkerke's R2), R2 test, Scaled Brier app and Scaled Brier test. Information is also provided
about testing the Calibration slope at each bootstrap step as interc test and Slope test.
The performance measures are pooled by a call to the function pool_performance. Another
object that can be extracted is intval, with information of the AUC, R2, Scaled Brier score and
Calibration slope averaged over the bootstrap samples, in terms of: Orig (original datasets),
Apparent (models applied in bootstrap samples), Test (bootstrap models are applied in original datasets),
Optimism (difference between apparent and test) and Corrected (original corrected for optimism).
Martijn Heymans, 2020
Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HC. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007(13);7:33.
F. Harrell. Regression Modeling Strategies. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd edition). Springer, New York, NY, 2015.
Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.
Harel, O. (2009). The estimation of R2 and adjusted R2 in incomplete data sets using multiple imputation. Journal of Applied Statistics, 36(10), 1109-1118.
Musoro JZ, Zwinderman AH, Puhan MA, ter Riet G, Geskus RB. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Methodol. 2014;14:116.
Wahl S, Boulesteix AL, Zierer A, Thorand B, van de Wiel MA. Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation. BMC Med Res Methodol. 2016;16(1):144.
EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.
http://missingdatasolutions.rbind.io/
pool_lr <- psfmi_lr(data=lbpmilr, formula = Chronic ~ Pain + JobDemands + rcs(Tampascale, 3) + factor(Satisfaction) + Smoking, p.crit = 1, direction="FW", nimp=5, impvar="Impnr", method="D1") pool_lr$RR_model res_perf <- psfmi_validate(pool_lr, val_method = "cv_MI", data_orig = lbp_orig, folds=3, nimp_cv = 2, p.crit=0.05, BW=TRUE, miceImp = miceImp, printFlag = FALSE) res_perf ## Not run: set.seed(200) res_val <- psfmi_validate(pobj, val_method = "boot_MI", data_orig = lbp_orig, nboot = 5, p.crit=0.05, BW=TRUE, miceImp = miceImp, nimp_mice = 5, printFlag = FALSE, direction = "FW") res_val$stats_val ## End(Not run)pool_lr <- psfmi_lr(data=lbpmilr, formula = Chronic ~ Pain + JobDemands + rcs(Tampascale, 3) + factor(Satisfaction) + Smoking, p.crit = 1, direction="FW", nimp=5, impvar="Impnr", method="D1") pool_lr$RR_model res_perf <- psfmi_validate(pool_lr, val_method = "cv_MI", data_orig = lbp_orig, folds=3, nimp_cv = 2, p.crit=0.05, BW=TRUE, miceImp = miceImp, printFlag = FALSE) res_perf ## Not run: set.seed(200) res_val <- psfmi_validate(pobj, val_method = "boot_MI", data_orig = lbp_orig, nboot = 5, p.crit=0.05, BW=TRUE, miceImp = miceImp, nimp_mice = 5, printFlag = FALSE, direction = "FW") res_val$stats_val ## End(Not run)
Risk calculation at specific time point for Cox model
risk_coxph(mod, t_risk)risk_coxph(mod, t_risk)
mod |
a Cox regression model object. |
t_risk |
Follow-up value to calculate cases, controls. See details. |
Cox regression Risk estimates at specific time point.
Martijn Heymans, 2023
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011;30(1):11-21
Inoue E (2018). nricens: NRI for Risk Prediction Models with Time to Event and Binary Response Data. R package version 1.6, <https://CRAN.R-project.org/package=nricens>.
Nagelkerke's R-square calculation for logistic regression / glm models
rsq_nagel(fitobj)rsq_nagel(fitobj)
fitobj |
a logistic regression model object of "glm" |
The value for the explained variance.
Martijn Heymans, 2020
psfmi_perform, pool_performance
R-square calculation for Cox regression models
rsq_surv(fitobj)rsq_surv(fitobj)
fitobj |
a Cox regression model object of "coxph" |
The value for the explained variance.
Martijn Heymans, 2021
F. Harrell. Regression Modeling Strategies. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd Edition. Springer, New York, NY, 2015.
Dataset with blood pressure measurements
data(sbp_age)data(sbp_age)
A data frame with 30 observations on the following 3 variables.
pat_idcontinuous
sbpcontinuous: systolic blood pressure
agecontinuous: age (years)
data(sbp_age) ## maybe str(sbp_age)data(sbp_age) ## maybe str(sbp_age)
Dataset with blood pressure measurements
data(sbp_qas)data(sbp_qas)
A data frame with 32 observations on the following 5 variables.
pat_idcontinuous
sbpcontinuous: systolic blood pressure
bmicontinuous: body mass index
agecontinuous: age (years)
smkdichotomous: 0 = no, 1 = yes
data(sbp_qas) ## maybe str(sbp_qas)data(sbp_qas) ## maybe str(sbp_qas)
Calculates the scaled Brier score
scaled_brier(obs, pred)scaled_brier(obs, pred)
obs |
Observed outcomes. |
pred |
Predicted outcomes in the form of probabilities. |
The value for the scaled Brier score.
Martijn Heymans, 2020
psfmi_perform, pool_performance
Survival data about smoking
data(smoking)data(smoking)
A data frame with 20 observations on the following 3 variables.
smokingdichotomous: 1=yes, 0=no
timecontinuous: Survival time in years
deathdichotomous: Status at end of study
data(smoking) ## maybe str(smoking)data(smoking) ## maybe str(smoking)
stab_single Stability analysis of predictors and prediction models selected with
the glm_bw.
stab_single(pobj, nboot = 20, p.crit = 0.05, start_model = TRUE)stab_single(pobj, nboot = 20, p.crit = 0.05, start_model = TRUE)
pobj |
An object of class |
nboot |
A numerical scalar. Number of bootstrap samples to evaluate the stability. Default is 20. |
p.crit |
A numerical scalar. Used as P-value selection criterium during bootstrap model selection. |
start_model |
If TRUE the bootstrap evaluation takes place from the start model of object pobj, if FALSE the final model is used for the evaluation. |
The function evaluates predictor selection frequency in bootstrap samples.
It uses as input an object of class smods as a result of a
previous call to the glm_bw.
A psfmi_stab object from which the following objects can be extracted: bootstrap
inclusion (selection) frequency of each predictor bif, total number each predictor is
included in the bootstrap samples as bif_total, percentage a predictor is selected
in each bootstrap sample as bif_perc and number of times a prediction model is selected in
the bootstrap samples as model_stab.
Heymans MW, van Buuren S. et al. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol. 2007;13:7-33.
Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: application to the Cox regression model. Stat Med. 1992;11:2093–109.
Royston P, Sauerbrei W (2008) Multivariable model-building – a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. (2008). Chapter 8, Model Stability. Wiley, Chichester.
Heinze G, Wallisch C, Dunkler D. Variable selection - A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431-449.
http://missingdatasolutions.rbind.io/
model_lr <- glm_bw(formula = Radiation ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Age + Duration + JobControl + JobDemands + SocialSupport, data=lbpmilr_dev, p.crit = 0.05) ## Not run: stab_res <- stab_single(model_lr, start_model = TRUE, nboot=20, p.crit=0.05) stab_res$bif stab_res$bif_perc stab_res$model_stab ## End(Not run)model_lr <- glm_bw(formula = Radiation ~ Pain + factor(Satisfaction) + rcs(Tampascale,3) + Age + Duration + JobControl + JobDemands + SocialSupport, data=lbpmilr_dev, p.crit = 0.05) ## Not run: stab_res <- stab_single(model_lr, start_model = TRUE, nboot=20, p.crit=0.05) stab_res$bif stab_res$bif_perc stab_res$model_stab ## End(Not run)
Dataset of persons from the The Amsterdam Growth and Health Longitudinal Study (AGHLS)
data(weight)data(weight)
A data frame with 450 observations on the following 7 variables.
IDcontinuous
SBPcontinuous: Systolic Blood Pressure
LDLcontinuous: Cholesterol
Glucosecontinuous
HDLcontinuous: Cholesterol
Genderdichotomous: 1=male, 0=female
Weightcontinuous: bodyweight
data(weight) ## maybe str(weight)data(weight) ## maybe str(weight)