`smdi` dataset background

To get acquainted with the functionality and usage of the smdi package, the package includes a simulated example dataset. The dataset is an exemplary low-dimensional electronic health records (EHR) dataset depicting a cohort of 2,500 lung cancer patients. The dataset follows the general one-row-per-patient structure, in which one row stands for an individual patient and the columns represent the variables.

Exposure and outcome

Let’s assume that we are interested in studying the comparative effectiveness of two antineoplastic systemic drug treatment regimens (exposure (0/1)) on the time to death due to any reason as the outcome (overall survival). The anticipated time of follow-up is truncated to 5 years.

The desired strength of effectiveness of the exposure of interest is defined with a hazard ratio (HR) of 1.0, i.e. there is no difference in overall survival among patients who are treated with the exposure of interest as compared to the comparator regimen. The proportional hazards assumption is fulfilled for this dataset.

Confounders

We further assume that there are a some confounders which we need to specify to estimate our outcome model. Most of the covariates are associated with both the probability of treatment initiation and the outcome but there are also some that are not predictive of the exposure and just the outcome or not associated with any of the exposure or outcome, whatsoever.

Despite the low dimensionality, the dataset is simulated as realistically as possible with varying strengths of associations between covariates and treatment initiation and the outcome.

Missingness

Most importantly for testing the functionality of this package, some of the above mentioned confounders are just partially observed according to the missingness mechanisms and proportions specified below in the table below.

Overview covariates/confounder structure

To get an overview of the dataset, this table provides a summary of the different covariate-exposure-outcome-missingness correlations.

Variable	Description	Associated with exposure/outcome	Missingness [%]
age_num	Age at baseline (continuous)	Yes/Yes
female_cat	Female gender (binary)	Yes/Yes
ecog_cat	ECOG performance score 0/1 or >1 (binary)	Yes/Yes	MCAR [35%]
smoking_cat	Smoker vs. non-smoker at baseline (binary)	Yes/Yes
physical_cat	Physically active vs not active (binary)	Yes/Yes
egfr_cat	EGFR alteration (binary)	Yes/Yes	MAR [40%]
alk_cat	ALK translocation (binary)	No/Yes
pdl1_num	PD-L1 expression in % (continuous)	Yes/Yes	MNAR(value) [20%]
histology_cat	Tumor histology squamous vs nonsquamous (binary)	No/Yes
ses_cat	Socio-economic status (multi-categorical)	No/No
copd_cat	History of COPD (binary)	No/No	Auxiliary to smoking

Simulation of covariates and exposure

set.seed(seed_value)

# start with basic dataframe, covariates and their association with exposure
sim_covar <- tibble::tibble(
  exposure = rbinom(n = n, size = 1, prob = 0.4),
  age_num = rnorm(n, mean = 64 - 7.5*exposure, sd = 13.7),
  female_cat = rbinom(n, size = 1, prob = 0.39 - 0.05*exposure),
  ecog_cat = rbinom(n, size = 1, prob = 0.63 - 0.04*exposure),
  smoking_cat = rbinom(n, size = 1, prob = 0.45 + 0.1*exposure), 
  physical_cat = rbinom(n, size = 1, prob = 0.35 + 0.02*exposure),
  egfr_cat = rbinom(n, size = 1, prob = 0.20 + 0.07*exposure), 
  alk_cat = rbinom(n, size = 1, prob = 0.03),
  pdl1_num = rnorm(n, mean = 40 + 10*exposure, sd = 10.5),
  histology_cat = rbinom(n, size = 1, prob = 0.2),
  ses_cat = sample(x = c("1_low", "2_middle", "3_high"), size = n, replace = TRUE, prob = c(0.2 , 0.4, 0.4)),
  copd_cat = rbinom(n, size = 1, prob = 0.3 + 0.5*smoking_cat)
  ) %>%  
  # bring data in right format
  dplyr::mutate(across(ends_with("num"), as.numeric)) %>% 
  dplyr::mutate(across(ends_with("num"), function(x) round(x, digits = 2)))

In the first step, we create a dataset with 2,500 patients and 12 variables with the following distributions.

sim_covar %>% 
  gtsummary::tbl_summary(by = "exposure") %>% 
  gtsummary::add_difference()

Characteristic	0, N = 1,502¹	1, N = 998¹	Difference²	95% CI^2,3	p-value²
age_num	63 (54, 73)	57 (48, 66)	6.5	5.4, 7.6	<0.001
female_cat	588 (39%)	334 (33%)	5.7%	1.8%, 9.6%	0.004
ecog_cat	923 (61%)	578 (58%)	3.5%	-0.48%, 7.5%	0.084
smoking_cat	686 (46%)	567 (57%)	-11%	-15%, -7.1%	<0.001
physical_cat	502 (33%)	380 (38%)	-4.7%	-8.6%, -0.73%	0.019
egfr_cat	286 (19%)	255 (26%)	-6.5%	-10%, -3.1%	<0.001
alk_cat	39 (2.6%)	30 (3.0%)	-0.41%	-1.8%, 1.0%	0.6
pdl1_num	40 (33, 47)	50 (43, 57)	-10	-11, -9.2	<0.001
histology_cat	288 (19%)	220 (22%)	-2.9%	-6.2%, 0.47%	0.090
ses_cat			0.05	-0.03, 0.13
1_low	302 (20%)	222 (22%)
2_middle	592 (39%)	377 (38%)
3_high	608 (40%)	399 (40%)
copd_cat	738 (49%)	600 (60%)	-11%	-15%, -7.0%	<0.001
¹ Median (IQR); n (%)
² Welch Two Sample t-test; Two sample test for equality of proportions; Standardized Mean Difference
³ CI = Confidence Interval

The following table illustrates the odds of exposure assignment.

exposure_form <- as.formula(paste("exposure ~ ", paste(colnames(sim_covar %>% dplyr::select(-exposure)), collapse = " + ")))

exposure_fit <- stats::glm(
  exposure_form,
  data = sim_covar,
  family = "binomial"
  )

exposure_fit %>% 
  gtsummary::tbl_regression(exponentiate = T)

Characteristic	OR¹	95% CI¹	p-value
age_num	0.97	0.96, 0.97	<0.001
female_cat	0.73	0.60, 0.88	0.001
ecog_cat	0.83	0.69, 1.00	0.054
smoking_cat	1.36	1.10, 1.69	0.005
physical_cat	1.27	1.05, 1.54	0.015
egfr_cat	1.47	1.18, 1.84	<0.001
alk_cat	1.36	0.77, 2.36	0.3
pdl1_num	1.10	1.08, 1.11	<0.001
histology_cat	1.15	0.91, 1.44	0.2
ses_cat
1_low	—	—
2_middle	0.88	0.69, 1.13	0.3
3_high	0.85	0.66, 1.09	0.2
copd_cat	1.39	1.12, 1.72	0.003
¹ OR = Odds Ratio, CI = Confidence Interval

Fitting a generalized linear model and assessing the probability of treatment assignment, the above constellation of odds results in the following simulated distributions depicting propensities of treatment initiation (aka propensity scores).

# compute propensity score
exposure_plot <- sim_covar %>% 
  dplyr::mutate(ps = fitted(exposure_fit))

# plot density
exposure_plot %>% 
  ggplot2::ggplot(ggplot2::aes(x = ps, fill = factor(exposure))) +
  ggplot2::geom_density(alpha = .5) +
  ggplot2::theme_bw() +
  ggplot2::labs(
    x = "Pr(exposure)",
    y = "Density",
    fill = "Exposed"
  )

Treatment assignment probabilities.

Simulate time-to-event

Next, we simulate a time-to-event outcome for overall survival. For this, the simsurv package is used with the following assumptions:

Parametric event times, following an exponential distribution
Max event times: 5 years of follow-up
Event times depend on some (time-fixed/baseline) covariate effects as listed in the table above
beta coefficients for the outcome model as defined in the following

betas_os <- c(
  exposure = log(1),
  age_num = log(1.05),
  female_cat = log(0.94),
  ecog_cat = log(1.25),
  smoking_cat = log(1.3),
  physical_cat = log(0.79),
  egfr_cat = log(0.5),
  alk_cat = log(0.91),
  pdl1_num = log(0.98),
  histology_cat = log(1.15)
  )

betas_os %>% 
  as.data.frame() %>% 
  dplyr::transmute(logHR = round(`.`, 2)) %>% 
  tibble::rownames_to_column(var = "Covariate") %>% 
  dplyr::mutate(HR = round(exp(logHR), 2)) %>% 
  gt::gt()

Covariate	logHR	HR
exposure	0.00	1.00
age_num	0.05	1.05
female_cat	-0.06	0.94
ecog_cat	0.22	1.25
smoking_cat	0.26	1.30
physical_cat	-0.24	0.79
egfr_cat	-0.69	0.50
alk_cat	-0.09	0.91
pdl1_num	-0.02	0.98
histology_cat	0.14	1.15

set.seed(seed_value)

sim_df <- sim_covar %>% dplyr::bind_cols(
  simsurv::simsurv(
    dist = "exponential",
    lambdas = 0.05,
    betas = betas_os,
    x = sim_covar,
    maxt = 5 
    )
  ) %>% 
  dplyr::select(-id)

Kaplan-Meier estimates

The simulation resulted in the following crude Kaplan-Meier estimates for 1-yr and 5-yr survival. Unadjusted estimates belown are shown overall and by exposure for overall survival.

km_overall <- survival::survfit(survival::Surv(eventtime, status) ~ 1, data = sim_df)
km_exposure <- survival::survfit(survival::Surv(eventtime, status) ~ exposure, data = sim_df)

gtsummary::tbl_survfit(
  list(km_overall, km_exposure),
  times = c(1, 5),
  label_header = "**{time} Years**"
  )

Characteristic	1 Years	5 Years
Overall	61% (60%, 63%)	19% (18%, 21%)
exposure
0	56% (54%, 59%)	15% (13%, 17%)
1	69% (66%, 72%)	26% (23%, 29%)

Given, that the true exposure effect is null, the crude model is severely biased as we can see even more clearly in the crude Kaplan-Meier curve.

km_exposure <- survival::survfit(survival::Surv(eventtime, status) ~ exposure, data = sim_df)

survminer::ggsurvplot(
  km_exposure, 
  data = sim_df,
  conf.int = TRUE,
  surv.median.line = "hv",
  palette = "jco",
  xlab = "Time [Years]",
  legend.labs = c("Comparator", "Exposure of interest")
  )

Cox proportional hazards

After adjusting, the simulated data results in the following hazard ratio (HR) estimates.

cox_lhs <- "survival::Surv(eventtime, status)"
cox_rhs <- paste(colnames(sim_covar), collapse = " + ")
cox_form = as.formula(paste(cox_lhs, "~ exposure +", cox_rhs))
  
cox_fit <- survival::coxph(cox_form, data = sim_df)

cox_fit %>% 
  gtsummary::tbl_regression(exponentiate = T)

Characteristic	HR¹	95% CI¹	p-value
exposure	1.01	0.91, 1.12	0.8
age_num	1.05	1.04, 1.05	<0.001
female_cat	0.92	0.84, 1.01	0.087
ecog_cat	1.15	1.05, 1.26	0.002
smoking_cat	1.44	1.30, 1.60	<0.001
physical_cat	0.84	0.77, 0.92	<0.001
egfr_cat	0.52	0.46, 0.58	<0.001
alk_cat	0.88	0.66, 1.16	0.4
pdl1_num	0.98	0.98, 0.98	<0.001
histology_cat	1.17	1.05, 1.30	0.004
ses_cat
1_low	—	—
2_middle	1.03	0.92, 1.16	0.6
3_high	1.04	0.92, 1.17	0.5
copd_cat	0.90	0.81, 1.00	0.042
¹ HR = Hazard Ratio, CI = Confidence Interval

Export `smdi_data_complete`

To provide flexibility to play with the complete data before introducing missingness, we offer two datasets:

smdi_data_complete: complete data
smdi_data: data with certain covariates missing (see next chapter)

smdi_data_complete <- sim_df
usethis::use_data(smdi_data_complete, overwrite = TRUE)

Introduce missingness

In many different quantitative disciplines from classic epidemiology to machine and deep learning there is an increasing interest in utilizing electronic health records (EHR) to develop prognostic/predictive models or study the comparative effectiveness and safety of medical interventions. Especially information on variables which are not readily available in other datasets (e.g. administrative claims) are of high interest, including vital signs, biomarkers and lab data. However, these covariates are often just partially observed for various reasons:

Physician did simply not perform/order a certain test
Certain measurements are just collected for particularly sick patients
Information is ‘hiding’ in unstructured records, e.g. clinical notes

To illustrate smdi's main functions using this dataset, we introduce some missingness to relevant covariates, which are critical confounders of the causal exposure-outcome relationship in the smdi_data dataset.

In order to introduce missingness following different missingness mechanisms, we use the ampute function of the mice package. An excellent tutorial on this very flexible and elegant function can be found here.

In brief, we introduce missingness by…

Determining a missingess pattern
Creating a weight vector, using fitted linear models for the probability of observations becoming missing
Specifying a type of logistic probability distribution for the missingness weights
Defining an overall missingness proportion

# prepare a placeholder df for missing simulation
# we do not consider ses_cat
tmp <- smdi_data_complete %>% 
  dplyr::select(-c(ses_cat))

# determine missingness pattern template
miss_pattern <- rep(1, ncol(tmp))

Missing complete at random

ecog_cat¹ will be set to missing according to the following specification:

Missingness pattern: Only ecog_cat will be set to missing
Weight vector: There are no weights, since the missingness is not dependent on any other observed or unobserved covariates
Type: Not applicable

# specify missingness pattern
# (0 = set to missing, 1 = remains complete)
mcar_col <- which(colnames(tmp)=="ecog_cat")
miss_pattern_mcar <- replace(miss_pattern, mcar_col, 0)

miss_prop_mcar <- .35

set.seed(42)
smdi_data_mcar <- mice::ampute(
  data = tmp,
  prop = miss_prop_mcar,
  mech = "MCAR",
  patterns = miss_pattern_mcar,
  bycases = TRUE
  )$amp %>% 
  dplyr::select(ecog_cat)

smdi_data_mcar %>% 
  dplyr::select(ecog_cat) %>% 
  dplyr::mutate(ecog_cat = forcats::fct_na_value_to_level(factor(ecog_cat), level = "missing")) %>% 
  gtsummary::tbl_summary()

Characteristic	N = 2,500¹
ecog_cat
0	629 (25%)
1	972 (39%)
missing	899 (36%)
¹ n (%)

Missing at random

egfr_cat will be set to missing according to the following specification:

Missingness pattern: Only egfr_cat will be set to missing
Weight vector: all covariates (except egfr_cat itself and ses_cat) contribute with equal weights as linear predictors for the probability of observations becoming missing
Type: higher weighted sum scores (i.e. higher odds of having an egfr_cat mutation) will have a larger probability of becoming incomplete
Define an overall missingness proportion of ~40%

# specify missingness pattern
# (0 = set to missing, 1 = remains complete)
mar_col <- which(colnames(tmp)=="egfr_cat")
miss_pattern_mar <- replace(miss_pattern, mar_col, 0)

# weights to compute missingness probability 
# by assigning a non-zero value
miss_weights_mar <- rep(1, ncol(tmp))
miss_weights_mar <- replace(miss_weights_mar, mar_col, 0)

miss_prop_mar <- .4

set.seed(42)
smdi_data_mar <- mice::ampute(
  data = tmp,
  prop = miss_prop_mar,
  mech = "MAR",
  patterns = miss_pattern_mar,
  weights = miss_weights_mar,
  bycases = TRUE,
  type = "RIGHT"
  )$amp

smdi_data_mar %>% 
  dplyr::select(egfr_cat) %>% 
  dplyr::mutate(egfr_cat = forcats::fct_na_value_to_level(factor(egfr_cat), level = "missing")) %>%
  gtsummary::tbl_summary()

Characteristic	N = 2,500¹
egfr_cat
0	1,175 (47%)
1	310 (12%)
missing	1,015 (41%)
¹ n (%)

Missing not at random - value

Missingness pattern: Only pdl1_num will be set to missing
Weight vector: Only pdl1_num itself (by a non-zero value) is the linear predictor for the probability of observations becoming missing
Type: Specify a type of logistic probability distribution for the missingness weights, so that cases with low weighted sum scores (i.e. lower pdl1_num expression) will have a larger probability of becoming incomplete
Define an overall missingness proportion of ~20%

# determine missingness pattern
mnar_v_col <- which(colnames(tmp)=="pdl1_num")
miss_pattern <- rep(1, ncol(tmp))
miss_pattern_mnar_v <- replace(miss_pattern, mnar_v_col, 0)

# weights to compute missingness probability 
# by assigning a non-zero value
# MNAR_v: covariate itself is only predictor
miss_weights_mnar_v <- rep(0, ncol(tmp))
miss_weights_mnar_v <- replace(miss_weights_mnar_v, mnar_v_col, 1)

miss_prop_mnar_v <- .2

set.seed(42)
smdi_data_mnar_v <- mice::ampute(
  data = tmp,
  prop = miss_prop_mnar_v,
  mech = "MNAR",
  patterns = miss_pattern_mnar_v,
  weights = miss_weights_mnar_v,
  bycases = TRUE,
  type = "LEFT"
  )$amp

smdi_data_mnar_v %>% 
  dplyr::select(pdl1_num) %>% 
  gtsummary::tbl_summary()

Characteristic	N = 2,500¹
pdl1_num	46 (39, 53)
Unknown	517
¹ Median (IQR)

Assemble final dataset

smdi_data <- smdi_data_complete %>% 
  dplyr::select(-c(ecog_cat, egfr_cat, pdl1_num)) %>% 
  dplyr::bind_cols(ecog_cat = smdi_data_mcar$ecog_cat, egfr_cat = smdi_data_mar$egfr_cat, pdl1_num = smdi_data_mnar_v$pdl1_num) %>% 
  mutate(across(ends_with("cat"), as.factor))

Overview final simulated dataset

Let’s have a look at the final dataset.

smdi_data %>% 
  gtsummary::tbl_summary(by = "exposure") %>% 
  gtsummary::add_overall() %>% 
  gtsummary::add_difference()

Characteristic	Overall, N = 2,500¹	0, N = 1,502¹	1, N = 998¹	Difference²	95% CI^2,3	p-value²
age_num	61 (51, 70)	63 (54, 73)	57 (48, 66)	6.5	5.4, 7.6	<0.001
female_cat				0.12	0.04, 0.20
0	1,578 (63%)	914 (61%)	664 (67%)
1	922 (37%)	588 (39%)	334 (33%)
smoking_cat				0.22	0.14, 0.30
0	1,247 (50%)	816 (54%)	431 (43%)
1	1,253 (50%)	686 (46%)	567 (57%)
physical_cat				0.10	0.02, 0.18
0	1,618 (65%)	1,000 (67%)	618 (62%)
1	882 (35%)	502 (33%)	380 (38%)
alk_cat				0.02	-0.06, 0.10
0	2,431 (97%)	1,463 (97%)	968 (97%)
1	69 (2.8%)	39 (2.6%)	30 (3.0%)
histology_cat				0.07	-0.01, 0.15
0	1,992 (80%)	1,214 (81%)	778 (78%)
1	508 (20%)	288 (19%)	220 (22%)
ses_cat				0.05	-0.03, 0.13
1_low	524 (21%)	302 (20%)	222 (22%)
2_middle	969 (39%)	592 (39%)	377 (38%)
3_high	1,007 (40%)	608 (40%)	399 (40%)
copd_cat				0.22	0.14, 0.30
0	1,162 (46%)	764 (51%)	398 (40%)
1	1,338 (54%)	738 (49%)	600 (60%)
eventtime	1.57 (0.52, 3.95)	1.28 (0.43, 3.13)	2.18 (0.75, 5.00)	-0.64	-0.78, -0.49	<0.001
status	2,017 (81%)	1,277 (85%)	740 (74%)	11%		<0.001
ecog_cat				0.04	-0.06, 0.14
0	629 (39%)	368 (39%)	261 (40%)
1	972 (61%)	586 (61%)	386 (60%)
Unknown	899	548	351
egfr_cat				0.20	0.09, 0.30
0	1,175 (79%)	823 (82%)	352 (74%)
1	310 (21%)	184 (18%)	126 (26%)
Unknown	1,015	495	520
pdl1_num	46 (39, 53)	42 (35, 49)	51 (45, 58)	-9.1	-10, -8.2	<0.001
Unknown	517	376	141
¹ Median (IQR); n (%)
² Welch Two Sample t-test; Standardized Mean Difference; 3-sample test for equality of proportions without continuity correction
³ CI = Confidence Interval

The Eastern Cooperative Oncology Group (ECOG) performance score is a clinical measure for how the cancer disease affects the daily living abilities of the patient and is often used as a patient inclusion criterion for clinical trials. The scale ranges from 0 to 5 and typically only patients with 0 and 1 are eligible. Let’s assume for our example, we already have a subset of such a pre-selected clinical trial-like cohort, but we still want to adjust for a baseline ECOG of 0 and 1 (definitions are taken from https://ecog-acrin.org/resources/ecog-performance-status/↩︎

Data generation

A simulated dataset for package functionality demonstration

Janick Weberpals

`smdi` dataset background

Exposure and outcome

Confounders

Missingness

Overview covariates/confounder structure

Simulation of covariates and exposure

Simulate time-to-event

Kaplan-Meier estimates

Cox proportional hazards

Export `smdi_data_complete`

Introduce missingness

Missing complete at random

Missing at random

Missing not at random - value

Assemble final dataset

Overview final simulated dataset

Export `smdi_data`

Data generation

A simulated dataset for package functionality demonstration

Janick Weberpals

smdi dataset background

Exposure and outcome

Confounders

Missingness

Overview covariates/confounder structure

Simulation of covariates and exposure

Simulate time-to-event

Kaplan-Meier estimates

Cox proportional hazards

Export smdi_data_complete

Introduce missingness

Missing complete at random

Missing at random

Missing not at random - value

Assemble final dataset

Overview final simulated dataset

Export smdi_data

`smdi` dataset background

Export `smdi_data_complete`

Export `smdi_data`