What is the Delta Method?

Alex Gold, Nat Olin, Annie Wang

2020-11-21

There are a number of ways to compute the standard errors for the margins of a regression. It might be possible to derive a probability density function for the margin itself, but that’s perhaps a huge pain and might not even exist. It is also possible to use simulation or bootstrapping to create standard errors for the margin.

In this package, we follow Stata’s margins command and use the delta method, which is a semi-parametric method that takes advantage of a closed-form solution to \(\frac{d(\text{link}^{-1}(X \beta))}{d(X \beta)}\) to improve computational time relative to simulation or bootstrap methods.

The Delta Method

The delta method is a general method for deriving the variance of a function of asymptotically normal random variables with known variance. In this case, the delta method takes advantage of the fact that the margin is (usually) an infinitely differentiable function of the data, \(X\), and the vector of \(\beta\)s to derive a closed-form solution for the standard errors of the margin.1

In particular, the delta method uses a Taylor series expansion of the inverse link function of the regression to approximate the margin in the neighborhood of \(X\) and the \(\beta\)s and derive variation near that point.

A Reminder about Taylor Series

The Taylor expansion is a useful tool because it allows us to restate a differentiable function, \(G(x)\), in terms of (an infinite sum of) the derivatives of \(G(x)\). To be more precise, an infinitely differentiable \(G(x)\) evaluated at \(a\) can be written as

\[G(x) = G(a) + \frac{G'(a)}{1!}(x - a) + \frac{G''(a)}{2!}(x-a)^2 + \frac{G'''(a)}{3!}(x-a)^3 + \dots\]

If we cut off the expansion after some number of terms (two is common), we can get a useful approximation of \(G(x)\).

Taylor Series and the Delta Method

In the case of predicted margins (levels), where the regression model has link function \(\text{link}\)2, the column vector of predicted levels \(P_m\) at covariates \(X_1\) is

\[P_m(X_1 \beta) = \text{link}^{-1}(X_1 \beta)\]

The predicted effects \(P_e\) of that same regression is a function of \(X \beta\) such that

\[P_e(X_1 \beta) = \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\]

Depending on whether the effect is over a continuous or categorical variable, this may be an actual derivative (the instantaneous rate of change) or the subtraction of \(P(X \beta)\) calculated at one value of \(X\) from another (the first difference).

Using the Taylor expansion, we can approximate \(P\), an arbitrary function of the random variable \(X \beta\) around the point \(X_1 \beta\), as3

\[P(X \beta) = P(X_1 \beta) + \frac{d(P(X_1 \beta))}{d(X\beta)}(X\beta - X_1 \beta)\]

Now we can substitute for the different margins we’ll care about. For predicted levels, the Taylor expansion is

\[P_m(X \beta) = \text{link}^{-1}(X_1 \beta) + \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}(X\beta - X_1 \beta)\]

For the predicted effects of categorical variables, we are trying to estimate the effects at \(P_e(X_1 \beta - X_2 \beta)\), which gives us

\[\begin{aligned} P_e(X \beta) &= \text{link}^{-1} (X_1 \beta - X_2 \beta) + \frac{d(\text{link}^{-1}(X_1 \beta - X_2 \beta))}{d(X \beta)}(X\beta - (X_1 \beta - X_2 \beta)) \\ % &= \text{link}^{-1} (X_1 \beta) - \text{link}^{-1} (X_2 \beta) + \frac{d(\text{link}^{-1})}{d(X \beta)}(X_1) - \frac{d(\text{link}^{-1})}{d(X \beta)}(X_2) \end{aligned} \]

For the predicted effects of continuous variables, the marginal effect is a derivative, so

\[\begin{aligned} P_e(X_1 \beta) &= \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)} + \frac{d^2(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\left(\frac{d(\text{link}^{-1}(X \beta))}{d(X \beta)} - \frac{d(\text{link}^{-1}(X_1 \beta))}{d(X \beta)}\right) \end{aligned} \]

How does the Delta Method Work?

The equations above describe how to approximate predicted levels or effects, but why not just calculate our estimate \(P(X\beta)\) directly?

Well, we can do that for the point estimate, but also want to calculate errors on that estimate, and the variance of \(P(X\beta)\) isn’t known. Fortunately, we can calculate the variance of the approximations above. Wikipedia contains a really nice derivation of the multivariate delta method:

\[\text{Var}[P(X \beta)] = \text{Var}\left[P(X_1 \beta) + \frac{d(P(X_1 \beta))}{d(X\beta)}(X\beta - X_1 \beta)\right]\]

Because \(X_1\beta\) is a known point, it has variance of zero, so this simplifies to

\[\text{Var}[P(X \beta)] = \text{Var}\left[\frac{d(P(X_1 \beta))}{d(X\beta)} \cdot X\beta\right]\]

The first term in the variance is the vector of partial derivatives of our estimator, also known as our jacobian matrix.4 For any predicted level indexed by \(i\) in a regression, the \(i,j\)th element of the jacobian will be the derivative of predicted level \(i\) with respect to regressor \(j\). Those are fixed quantities. We also know the variance of \(X\beta\), since that’s our variance-covariance matrix \(V\).5

So we want the variance of a random variable multiplied by a fixed matrix, which we can find.6

\[\text{Var}[P(X \beta)] = \left(\frac{d(P(X_1 \beta))}{d(X\beta)}\right)^T \ V \left(\frac{d(P(X_1 \beta))}{d(X\beta)}\right)\]

So practically speaking, to get our variance, we’ll pre- and post-multiply the partial derivatives of the inverse link function by the original variance-covariance matrix from the regression.

In short,

  1. Calculate the jacobian matrix of the inverse link function of \(X \beta\), \(J\).7

  2. Get the variance-covariance matrix, \(V\), from the regression output, or calculate it some other way.8

  3. Sandwich multiply the matrices: \(J^{T} V J\). You’ll end up with a \(k \times 1\) matrix for the \(k\) predicted levels/effects.

Example

Say we’re fitting a logistic regression using the margex data, and we’re interested in the predicted outcome for different treatment groups:

library(modmarg)
data(margex)
lg <- glm(outcome ~ treatment * age, data = margex, family = 'binomial')
summary(lg)
## 
## Call:
## glm(formula = outcome ~ treatment * age, family = "binomial", 
##     data = margex)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3200  -0.6033  -0.3212  -0.1580   2.9628  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -7.03092    0.50247 -13.993   <2e-16 ***
## treatment      1.35170    0.62208   2.173   0.0298 *  
## age            0.11060    0.01069  10.347   <2e-16 ***
## treatment:age -0.01046    0.01301  -0.804   0.4216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2732.1  on 2999  degrees of freedom
## Residual deviance: 2169.4  on 2996  degrees of freedom
## AIC: 2177.4
## 
## Number of Fisher Scoring iterations: 6

where the treatment variable \(\tau \in {0, 1}\). What is the average outcome when \(\tau = 0\) or \(\tau = 1\)?

Based on these results alone, we don’t really know what the predicted outcome is for different treatment groups because we can’t directly interpret the coefficients: there’s an interaction term, and the effect of each covariate depends on the levels of the other covariates.

Point estimates

To get an estimate of the average levels of the outcome for different values of treatment, we can set treatment to 0 or 1 for the entire dataset, generate predicted outcomes for the dataset, and average the results (in other words, what would the outcome have been, if everybody was in the control / treatment group?). The linear predictors of our model (in this case, setting \(\tau = t\)), are given by: \[ \hat{y}_i = \alpha + \beta_1 \cdot t + \beta_2 \cdot \text{age}_i + \beta_3 \cdot (t \cdot \text{age}_i) \] also known as \(\hat{y}_i = X_i\beta\). We can then transform the linear predictors to the scale of the outcome variable (predicted probabilities) with the inverse link function: \[f(z) = \frac{1}{1 + \exp(-z)}\] Finally, we can find the average predicted level, by taking the average of these predicted probabilities. Combining these, we have: \[ P(X\beta) = \frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta)} \] Now let’s do this in R: set treatment to 0 or 1, generate linear predictors, transform them to predicted outcomes, and take the mean.

# Extract the n x k matrix of data
x <- model.matrix(lg)

# Extract the coefficients from the model (a column vector with k entries)
beta <- matrix(lg$coefficients)

# CONTROL:

# Set treatment and treatment:age to 0 for all observations
x[, "treatment"] <- 0
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Get linear predictors
pred_ctl <- x %*% beta

# Apply the inverse link function to get predicted probabilities
pp_ctl <- 1 / (1 + exp(-pred_ctl))

# Get the average predicted probability
mean_pp_ctl <- mean(pp_ctl)

# TREATMENT:

# Set treatment to 1 and treatment:age to age for all observations
x[, "treatment"] <- 1
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Get linear predictors
pred_treat <- x %*% beta

# Apply the inverse link function to get predicted probabilities
pp_treat <- 1 / (1 + exp(-pred_treat))

# Get the average predicted probability
mean_pp_treat <- mean(pp_treat)

# RESULTS:

mean_pp_ctl
## [1] 0.1126685
mean_pp_treat
## [1] 0.208375

Variance

So far this is pretty straightforward, but we want to know the standard error of these estimates.

You might be tempted to just use the R command, predict(model, newdata, se.fit = TRUE), fix treatment to 0 or 1, and find the average of the resulting standard errors. Don’t do that! You’ll average the standard errors of the predicted outcome of each observation in your dataset - which is not the same thing as finding the standard error for a specific estimate, which is what we’re looking for.

As explained above, we can approximate the variance of the predicted margins: using \(J\) to represent the jacobian (the derivative of our transformation with respect to each \(\beta\) parameter), then

\[\text{Var}(P(X\beta)) = J V J^T\]

Going back to our function from before, let’s start with the coefficient on treatment, \(\beta_1\). First we just apply the quotient rule (and the chain rule in the last step):

\[\begin{align*} P(X\beta) &= \frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta)}\\ \frac{\partial}{\partial \beta_1} P(X\beta) &= \frac{\partial}{\partial \beta_1} \left[\frac{1}{n} \sum_{i = 1}^n \frac{1}{1 + \exp(-X_i\beta )}\right]\\ &= \frac{1}{n} \sum_{i = 1}^n \frac{\partial}{\partial \beta_1} \left[\frac{1}{1 + \exp(-X_i\beta)}\right]\\ &= \frac{1}{n} \sum_{i = 1}^n \frac{-\frac{\partial}{\partial \beta_1}[1 + \exp(-X_i\beta)]}{(1 + \exp(-X_i\beta))^2} \\ &= \frac{1}{n} \sum_{i = 1}^n \frac{-\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \frac{\partial}{\partial \beta_1} [-X_i\beta] \end{align*}\]

Coming up for air for a second: what’s the last term in that final expression?

\(X_i\beta = \alpha + \beta_1 \tau_i + \beta_2 \text{age}_i + \beta_3 (\tau_i \cdot \text{age}_i)\), so the derivative of that with respect to \(\beta_1\) is just \(\tau_i\). Therefore,

\[\begin{align*} \frac{\partial}{\partial \beta_1} P(X\beta) &= \frac{1}{n} \sum_{i = 1}^n \frac{-\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot (-\tau_i)\\ &= \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i \end{align*}\] which is the first term of our jacobian.9

What about the other terms? Well, because the \(X_i\beta\) term is additively separable, each term is the same, except for having a different term at the end. For example, the derivative with respect to \(\beta_2\) is going to be

\[\frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta))}{(1 + \exp(-X_i\beta))^2} \cdot \text{age}_i\].

So the full jacobian is \[ J = \left[ \begin{array}{l} \\ \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot 1\\ \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i\\ \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \text{age}_i\\ \frac{1}{n} \sum_{i = 1}^n \frac{\exp(-X_i\beta)}{(1 + \exp(-X_i\beta))^2} \cdot \tau_i \cdot \text{age}_i \end{array} \right] \]

This can be rewritten as

\[ J = \frac{1}{n} \left[ \begin{array} \\ \frac{\exp(-X_1\beta)}{(1 + \exp(-X_1\beta))^2} & \frac{\exp(-X_2\beta)}{(1 + \exp(-X_2\beta))^2} & \cdots & \frac{\exp(-X_n\beta)}{(1 + \exp(-X_n\beta))^2} \end{array} \right] \left[ \begin{array}{cccc}\\ 1 & \tau_1 & \text{age}_1 & \tau_1 \cdot \text{age}_1 \\ 1 & \tau_2 & \text{age}_2 & \tau_2 \cdot \text{age}_2 \\ \vdots & \vdots & \vdots & \vdots\\ 1 & \tau_n & \text{age}_n & \tau_n \cdot \text{age}_n \\ \end{array} \right] \]

The first term applies the derivative of the inverse link function to every linear predictor in the model, and the second term is the covariate matrix from our data! How convenient! That’s going to be true for all general linear models. So we can rewrite this as

\[J = \frac{\left[\begin{array}\\ \frac{\exp(-X_1\beta)}{(1 + \exp(-X_1\beta))^2} & \frac{\exp(-X_2\beta)}{(1 + \exp(-X_2\beta))^2} & \cdots & \frac{\exp(-X_n\beta)}{(1 + \exp(-X_n\beta))^2}\end{array}\right] X}{n}\]

Recalling that our variance is \(J V J^T\), and that we’ve already calculated \(X_i\beta\) for all \(i\), this is relatively simple to do in R:

# Get the data
x <- model.matrix(lg)

# CONTROL ERROR

# Apply the derivative of the inverse link function to the linear predictors
deriv <- as.vector(exp(-pred_ctl) / (1 + exp(-pred_ctl))^2)

# Set treatment to 0
x[, 'treatment'] <- 0
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

# Complete the chain rule by matrix-multiplying the derivatives by the data,
# now we have the jacobian
j <- deriv %*% x / nrow(x)

# The variance of our estimate is the cross product of the jacobian and the model's
# variance-covariance matrix
variance <- j %*% vcov(lg) %*% t(j)

# The error is the square root of that
se_ctl <- sqrt(diag(variance))

# TREATMENT ERROR: same logic

deriv <- as.vector(exp(-pred_treat) / (1 + exp(-pred_treat))^2)

x <- model.matrix(lg)

x[, 'treatment'] <- 1
x[, 'treatment:age'] <- x[, 'treatment'] * x[, 'age']

j <- deriv %*% x / nrow(x)

variance <- j %*% vcov(lg) %*% t(j)
se_treat <- sqrt(diag(variance))

se_ctl
## [1] 0.009334252
se_treat
## [1] 0.009089667

Review

OK, we’ve got our estimates and variance:

result <- data.frame(
  Label = c("treatment = 0", "treatment = 1"),
  Margin = c(mean_pp_ctl, mean_pp_treat),
  Standard.Error = c(se_ctl, se_treat)
)

result
##           Label    Margin Standard.Error
## 1 treatment = 0 0.1126685    0.009334252
## 2 treatment = 1 0.2083750    0.009089667

What do we get from modmarg?

marg <- modmarg::marg(mod = lg, var_interest = 'treatment')
marg[[1]][, c("Label", "Margin", "Standard.Error")]
##           Label    Margin Standard.Error
## 1 treatment = 0 0.1126685    0.009334252
## 2 treatment = 1 0.2083750    0.009089667

Hooray!


  1. “Usually” because this statement requires that the canonical link function for the regression has a closed-form derivative. Luckily, this is true for most common forms of linear regression.↩︎

  2. The exact form of the link function and its inverse will depend on the type of regression. For example, the logit function is the canonical link function for logistic regression and allows transformations between probabilities and log-odds.↩︎

  3. Note that we treat the input \(X\) as fixed and \(\beta\) as a random variable.↩︎

  4. The Jacobian matrix is just the name of the matrix of all first partial derivatives of a vector-valued function.↩︎

  5. As a quick reminder about the variance-covariance matrix, if your model contains coefficients \(b_0\) to \(b_n\) (each with mean \(\mu_{b_i}\) and standard deviation \(\sigma_{b_i}\)), the \(i,j\)th element of the variance-covariance matrix is \(cov(b_i, b_j)\).↩︎

  6. Quick reminder: for scalar \(a\) and \(r\), where \(b\) is the variance of random variable \(r\), \(\text{Var}(a \cdot r)\) is \(a^2 \cdot b\). The matrix analog of \(a^2 \cdot \text{Var}(b)\) is \(ABA^T\).↩︎

  7. Some practical notes on calculating the jacobian: the example at the bottom of the vignette captures how to calculate the jacobian for predicted levels pretty thoroughly. For predictive effects, the big change is that you are now calculating the variance on predicted effects, so you need to take the second derivative of the link function (instead of the first derivative). For categorical variables, you can take the second derivative by subtracting each of the levels from the base level and returning that as the jacobian. For continuous variables, you need to actually compute the second derivative of the link function and use that in place of the first derivative above. You need to explicitly compute the second derivative because you want an instantaneous rate of change as opposed to the rate of change over a range as with categorical variables above.↩︎

  8. Usually you’ll just want the variance-covariance matrix of the regression, but you’ll need to modify the standard variance-covariance matrix if you want to cluster standard errors or something.↩︎

  9. Note that the first part of the expression is sometimes written \(\frac{1}{1 + \exp(-X_i\beta)} \cdot \frac{1}{1 + \exp(X_i\beta)}\) since that implies \(\frac{\partial}{\partial \beta} \text{logit}^{-1}(X\beta) = \text{logit}^{-1}(X\beta) \cdot \text{logit}^{-1}(-X\beta)\)↩︎