Using Survey Weights with nhanesA

Robert Gentleman and Teresa Filshtein Sonmez

2024-04-23

NHANES and Survey Weights

NHANES aims to produce national estimates on a range of health, nutrition, and other factors that accurately represent the non-institutionalized civilian U.S. population. The data is gathered using a complex, four-stage sampling design. When analyzing this data, it’s crucial to use specialized survey analysis techniques that incorporate specific weights. These weights account for the intricate sampling strategy employed during data collection.

You can find the NHANES sample weights as part of the downloadable data. For a detailed understanding of these weights and their correct usage, the CDC provides extensive documentation: CDC NHANES Tutorials.

It’s worth noting that the methodologies and strategies for NHANES have evolved over time. Reasons for these changes range from methodological enhancements, external events like the COVID epidemic, to a recent shift towards a more longitudinal perspective. Such changes can significantly influence any analysis. Thus, analysts should approach this data with diligence and care.

There are numerous online resources that offer worked examples for further understanding:

  1. The official resource for the R survey package: R-Survey.
  2. An up-to-date resource by the Advanced Research Computing group at UCLA: Survey Data Analysis with R.
  3. A growing collection of presentations on platforms like YouTube.

This vignette’s intent is not to serve as an exhaustive guide. Instead, we aim to showcase some fundamental functions and point you towards more in-depth online resources necessary for specialized analyses. Here we will cover the basics of setting up the survey design and performing basic tasks. For those interested in diving deeper, we encourage you to develop and share more comprehensive documents.

Example

Using the nhanesA and survey packages, we will explore the average blood pressure of individuals over 40 years of age, segmented by their reported ethnicity, during the 2017-2018 cycle. To achieve this, we’ll merge the demographic (DEMO_J) and blood pressure (BXP_J) datasets. Our examination will cover average blood pressure distributions across different sexes and ethnicities, and we’ll conduct a regression analysis to understand the relationship between blood pressure and age. This is a basic demonstration; comprehensive analyses would need to consider various risk factors and align with specific epidemiological queries.

Pull in the relevant data

library("nhanesA")

demoj = nhanes("DEMO_J")
dim(demoj)
## [1] 9254   46
## merge DEMO_J and BXP_J using SEQN.  
bpxj = nhanes("BPX_J")
data = merge(demoj, bpxj, by="SEQN")
dim(data)
## [1] 8704   66

Create survey design object

To generate accurate estimates, it’s essential to establish a survey design object that integrates the weights into our analysis. This design object should be created before any subsetting or manipulation of the data. By doing this, we ensure that intricate survey design elements, like stratification and clustering, are fully captured and applied across the dataset.

It is also crucial to choose the appropriate weighting scheme for your analysis. NHANES offers various weights, such as interview, examination, fasting, and MEC laboratory weights, to name a few. The selection hinges on the specific dataset components and analyses you’re focusing on. Moreover, when analyzing data across multiple cycles, it will be necessary to modify these weights to ensure accurate representation and inference.

For a comprehensive understanding, the CDC provides in-depth guidance on its website: CDC NHANES Weighting Tutorial.

library("survey")
nhanesDesign <- svydesign(id = ~SDMVPSU,  # Primary Sampling Units (PSU)
                          strata  = ~SDMVSTRA, # Stratification used in the survey
                          weights = ~WTMEC2YR,   # Survey weights
                          nest    = TRUE,      # Whether PSUs are nested within strata
                          data    = data)

Subset the survey design object

Next, we subset the data to focus on individuals over 40 years of age. At this stage, it’s imperative to use the tools in the survey package for any data manipulations. This procedure guarantees that weight adjustments are correctly applied, ensuring the validity of subsequent analyses. For comparison purposes, we also create an additional subset without the survey design framework, which allows for easy examination of unadjusted values.

# subset survey design object
dfsub = subset(nhanesDesign, data$RIDAGEYR>=40)

# subset the original dataset
datasub = data[data$RIDAGEYR>=40,]

Basic data exploration

Next, we’ll explore the variable RIDRETH1 from the DEMO_J table, which represents reported ethnicity. We’ll focus on diastolic blood pressure for our analysis. For simplicity in our presentation, we’ll utilize only the first measurement, represented by the variable BPXDI1 in the BPX_J table.

Calculating means

First we split by ethnicity and calculate mean blood pressure for each group on the unweighted data.

## mean on total data set

mean(datasub$BPXDI1, na.rm = TRUE)
## [1] 73.04455
##split the data by ethnicity and calculate mean of the unweighted data

unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
unweighted_means 
##                    Mexican American                      Other Hispanic 
##                            72.41000                            72.97611 
##                  Non-Hispanic White                  Non-Hispanic Black 
##                            70.84130                            75.71466 
## Other Race - Including Multi-Racial 
##                            74.41311

Now, we perform the same analysis using the survey weights.

adjmns = svymean(~BPXDI1, dfsub, na.rm=TRUE)
adjmns
##          mean     SE
## BPXDI1 73.237 0.5597
# By ethnicity
adjmnsbyEth = svyby(~BPXDI1, ~RIDRETH1, dfsub, svymean, na.rm=TRUE)

weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)

combined_data <- cbind(unweighted_means, weighted_means)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data
##                                     Unweighted Weighted
## Mexican American                      72.41000 74.03194
## Other Hispanic                        72.97611 74.73756
## Non-Hispanic White                    70.84130 72.45422
## Non-Hispanic Black                    75.71466 75.71874
## Other Race - Including Multi-Racial   74.41311 74.60215

We can do this with gender and any other categorical variable:

# By Gender 
mns = sapply(split(datasub$BPXDI1, datasub$RIAGENDR), mean, na.rm=TRUE)

adjmnsbyGen = svyby(~BPXDI1, ~RIAGENDR, dfsub, svymean, na.rm=TRUE)

combined_data <- cbind(mns, adjmnsbyGen$BPXDI1)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data
##        Unweighted Weighted
## Male     74.67330 75.31134
## Female   71.43385 71.31453

In addition to computing the mean, various other summary statistics, such as quantiles and variance, can be readily calculated using the survey design object. Importantly, each of these summary statistics will come with a corresponding standard error that accounts for the weights, providing insights into the precision of the estimates.

Calculating quantiles
# For the unweighted data

quantile(datasub$BPXDI1, c(0.25,0.5,.75), na.rm = TRUE)
## 25% 50% 75% 
##  66  74  82
# For the survey weighted data
svyquantile(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE)
## $BPXDI1
##      quantile ci.2.5 ci.97.5        se
## 0.25       66     66      68 0.4691643
## 0.5        74     74      76 0.4691643
## 0.75       80     80      82 0.4691643
## 
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"
# By Gender 
svyby(~BPXDI1, ~RIAGENDR, dfsub, svyquantile, quantiles = c(0.5), na.rm=TRUE)
##        RIAGENDR BPXDI1 se.BPXDI1
## Male       Male     76 0.9383286
## Female   Female     72 0.4691643

Note: The columns ci.2.5 and ci.97.5 in the output represent percentile intervals, not the traditional confidence intervals. Specifically, for a given quantile (like the 25th percentile), the values in these columns denote the range between the 2.5th percentile and the 97.5th percentile of that particular quantile estimate, when considering the survey weights.

Calculating variances
# For the entire dataset
svyvar(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE)
##        variance     SE
## BPXDI1   173.68 12.279
# By ethnicity
svyby(~BPXDI1, ~RIDRETH1, dfsub, svyvar, na.rm=TRUE)
##                                                                RIDRETH1
## Mexican American                                       Mexican American
## Other Hispanic                                           Other Hispanic
## Non-Hispanic White                                   Non-Hispanic White
## Non-Hispanic Black                                   Non-Hispanic Black
## Other Race - Including Multi-Racial Other Race - Including Multi-Racial
##                                       BPXDI1       se
## Mexican American                    200.8781 45.43074
## Other Hispanic                      160.0157 23.79938
## Non-Hispanic White                  168.3504 18.30355
## Non-Hispanic Black                  202.3565 37.12593
## Other Race - Including Multi-Racial 156.9968 17.26037
# By Gender 
svyby(~BPXDI1, ~RIAGENDR, dfsub, svyvar, na.rm=TRUE)
##        RIAGENDR   BPXDI1       se
## Male       Male 141.2704 10.52796
## Female   Female 196.1199 20.65706

Conducting Simple Regression Analysis

The svyglm function allows us to perform regression analyses while accounting for the survey’s design. In this section, we will demonstrate how to model diastolic blood pressure levels based on age and ethnicity.

Syntax for svyglm

Using svyglm() is akin to utilizing the glm() function from base R. The key difference is that svyglm() is specifically tailored to work with survey objects, considering the intricate sampling design and weights.

To model diastolic blood pressure (noted as BPXDI1 in our dataset) as a function of age (RIDAGEYR) and ethnicity (RIDRETH3), we can construct our model as outlined below. The resulting output will furnish regression coefficients, standard errors, and significance tests that have been adjusted for the complex survey design. When interpreting these results, it’s crucial to remember that they resemble outputs from a generalized linear model (glm). However, the estimates provided by svyglm() have been weighted to produce nationally representative conclusions.

weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())
summary(weighted_model)
## 
## Call:
## svyglm(formula = BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, 
##     family = gaussian())
## 
## Survey design:
## subset(nhanesDesign, data$RIDAGEYR >= 40)
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                 90.96516    1.24720  72.935
## RIDAGEYR                                    -0.31522    0.01853 -17.012
## RIDRETH1Other Hispanic                       1.43894    1.25518   1.146
## RIDRETH1Non-Hispanic White                   0.49810    0.73132   0.681
## RIDRETH1Non-Hispanic Black                   2.92332    1.09594   2.667
## RIDRETH1Other Race - Including Multi-Racial  1.38532    0.63780   2.172
##                                             Pr(>|t|)    
## (Intercept)                                 5.73e-15 ***
## RIDAGEYR                                    1.04e-08 ***
## RIDRETH1Other Hispanic                        0.2783    
## RIDRETH1Non-Hispanic White                    0.5113    
## RIDRETH1Non-Hispanic Black                    0.0236 *  
## RIDRETH1Other Race - Including Multi-Racial   0.0550 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 162.6653)
## 
## Number of Fisher Scoring iterations: 2

Visualizing differences in weighted and unweighted survey data

Visualizing the differences between weighted and unweighted analyses can offer a clear, intuitive understanding of the impact of weights. We can use bar plots to compare mean diastolic blood pressures across different ethnicities. This plot could be extended to other metrics as well.

library(ggplot2)

## recalculating means (same as above) 

unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)


plot_data <- data.frame(
  Ethnicity = factor(rep(names(unweighted_means), 2),
                     levels = names(unweighted_means)),
  Means = c(unweighted_means, weighted_means),
  Type = factor(c(rep("Unweighted", length(unweighted_means)),
                  rep("Weighted", length(weighted_means))),
               levels = c("Unweighted", "Weighted"))
)

# Creating the plot
ggplot(plot_data, aes(x = Ethnicity, y = Means, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  labs(title = "Comparison of Diastolic Blood Pressure by Ethnicity",
       y = "Mean Diastolic Blood Pressure") +
  scale_x_discrete(labels = c('Mex-Amer','Other Hisp','White', 'Black', 'Other')) +
  scale_fill_manual(values = c("blue", "red")) +
  theme_minimal() +
  theme(legend.title = element_blank())

Visualizing Regression Results

Visualizing regression coefficients can provide a clear understanding of how different variables impact the outcome, especially when using weighted and unweighted data.

unweighted_model <- glm(BPXDI1 ~ RIDAGEYR + RIDRETH1, data = datasub, family = gaussian())
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())

# For the unweighted model
unweighted_summary <- summary(unweighted_model)
unweighted_coefs <- unweighted_summary$coefficients[, "Estimate"]
unweighted_se <- unweighted_summary$coefficients[, "Std. Error"]

# For the weighted model
weighted_summary <- summary(weighted_model)
weighted_coefs <- as.numeric(weighted_summary$coefficients[, "Estimate"])
weighted_se <- as.numeric(weighted_summary$coefficients[, "Std. Error"])

comparison <- data.frame(
  Variable = names(unweighted_coefs),
  Unweighted = unweighted_coefs,
  Weighted = as.numeric(weighted_coefs)
)

print(comparison)
##                                                                                Variable
## (Intercept)                                                                 (Intercept)
## RIDAGEYR                                                                       RIDAGEYR
## RIDRETH1Other Hispanic                                           RIDRETH1Other Hispanic
## RIDRETH1Non-Hispanic White                                   RIDRETH1Non-Hispanic White
## RIDRETH1Non-Hispanic Black                                   RIDRETH1Non-Hispanic Black
## RIDRETH1Other Race - Including Multi-Racial RIDRETH1Other Race - Including Multi-Racial
##                                             Unweighted   Weighted
## (Intercept)                                 92.2848149 90.9651590
## RIDAGEYR                                    -0.3460552 -0.3152158
## RIDRETH1Other Hispanic                       1.2325637  1.4389441
## RIDRETH1Non-Hispanic White                   0.7553912  0.4981020
## RIDRETH1Non-Hispanic Black                   4.2996735  2.9233165
## RIDRETH1Other Race - Including Multi-Racial  2.0667520  1.3853204
unweighted_df <- data.frame(Variable = names(unweighted_coefs), 
                            Estimate = unweighted_coefs, 
                            SE = unweighted_se, 
                            Type = "Unweighted")

weighted_df <- data.frame(Variable = names(unweighted_coefs), 
                          Estimate = weighted_coefs, 
                          SE = weighted_se, 
                          Type = "Weighted")

plot_data <- rbind(unweighted_df, weighted_df)

ggplot(subset(plot_data, Variable!='(Intercept)'), aes(x = Estimate, y = reorder(Variable, Estimate), color = Type)) +
  geom_point(position = position_dodge(0.5), size = 2.5) +
  geom_errorbarh(aes(xmin = Estimate - SE, xmax = Estimate + SE),
                 height = 0.2, position = position_dodge(0.5)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  labs(title = "Comparison of Regression Coefficients",
       x = "Coefficient Value", y = "Predictors") +
  theme_minimal() +
  scale_color_manual(values = c("Unweighted" = "blue", "Weighted" = "red")) +
  theme(legend.title = element_blank())