Enriching glm objects

Ioannis Kosmidis

2017-11-05

Introduction

The enrichwith R package provides the enrich method to enrich list-like R objects with new, relevant components. The resulting objects preserve their class, so all methods associated with them still apply.

This vignette is a demo of the available enrichment options for glm objects.

Clotting data set

The following data set is provided in McCullagh and Nelder (1989 Section 8.4) and consists of observations on \(n = 18\) mean clotting times of blood in seconds (time) for each combination of nine percentage concentrations of normal plasma (conc) and two lots of clotting agent (lot).

clotting <- data.frame(conc = c(5,10,15,20,30,40,60,80,100,
                                5,10,15,20,30,40,60,80,100),
                       time = c(118, 58, 42, 35, 27, 25, 21, 19, 18,
                                69, 35, 26, 21, 18, 16, 13, 12, 12),
                       lot = factor(c(rep(1, 9), rep(2, 9))))

McCullagh and Nelder (1989 Section 8.4) fitted a series of nested generalized linear models assuming that the times are realisations of independent Gamma random variables whose mean varies appropriately with concentration and lot. In particular, McCullagh and Nelder (1989) linked the inverse mean of the gamma random variables to the linear predictors ~ 1, ~ log(conc), ~ log(conc) + lot, ~ log(conc) * lot and carried out an analysis of deviance to conclude that the ~ log(u) * lot provides the best explanation of clotting times. The really close fit of that model to the data can be seen in the figure below.

library("ggplot2")
clottingML <- glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
alpha <- 0.01
pr_out <- predict(clottingML, type = "response", se.fit = TRUE)
new_data <- clotting
new_data$time <- pr_out$fit
new_data$type <- "fitted"
clotting$type <- "observed"
all_data <- rbind(clotting, new_data)
new_data <- within(new_data, {
    low <- pr_out$fit - qnorm(1 - alpha/2) * pr_out$se.fit
    upp <- pr_out$fit + qnorm(1 - alpha/2) * pr_out$se.fit
})
ggplot(all_data) +
    geom_point(aes(conc, time, col = type), alpha = 0.8) +
    geom_segment(data = new_data, aes(x = conc, y = low, xend = conc, yend = upp, col = type)) +
    facet_grid(. ~ lot) +
    theme_bw() +
    theme(legend.position = "top")

Key quantities in likelihood inference

Score function

The score function is the gradient of the log-likelihood and is a key object for likelihood inference.

The enrichwith R package provides methods for the enrichment of glm objects with the corresponding score function. This can either be done by enriching the glm object with auxiliary_functions and then extracting the score function

library("enrichwith")
enriched_clottingML <- enrich(clottingML, with = "auxiliary functions")
scores_clottingML <- enriched_clottingML$auxiliary_functions$score

or directly using the get_score_function convenience method

scores_clottingML <- get_score_function(clottingML)

By definition, the score function has to have zero components when evaluated at the maximum likelihood estimates (only numerically zero here).

scores_clottingML()
##    (Intercept)      log(conc)           lot2 log(conc):lot2     dispersion 
##   1.227818e-05   2.253249e-05   6.113431e-07   1.548145e-06   1.647695e-09 
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971

Information matrix

Another key quantity in likelihood inference is the expected information.

The auxiliary_functions enrichment option of the enrich method enriches a glm object with a function for the evaluation of the expected information.

info_clottingML <- enriched_clottingML$auxiliary_functions$information

and can also be computed directly using the get_information_function method

info_clottingML <- get_information_function(clottingML)

One of the uses of the expected information is the calculation of standard errors for the model parameters. The stats::summary.glm function already does that for glm objects, estimating the dispersion parameter (if any) using Pearson residuals.

summary_clottingML <- summary(clottingML)

Duly, info_clottingML returns (numerically) the same standard errors as the summary method does.

summary_std_errors <- coef(summary_clottingML)[, "Std. Error"]
einfo <- info_clottingML(dispersion = summary_clottingML$dispersion)
all.equal(sqrt(diag(solve(einfo)))[1:4], summary_std_errors, tolerance = 1e-05)
## [1] TRUE

Another estimate of the standard errors results by the observed information, which is the negative Hessian matrix of the log-likelihood.

At least at the time of writting the current vignette, there appears to be no general implementation of the observed information for glm objects. I guess the reason for that is the dependence of the observed information on higher-order derivatives of the inverse link and variance functions, which are not readily available in base R.

enrichwith provides options for the enrichment of link-glm and family objects with such derivatives (see ?enrich.link-glm and ?enrich.family for details), and, based on those allows the enrichment of glm objects with a function to compute the observed information.

The observed and the expected information for the regression parameters coincide for GLMs with canonical link, like clottingML

oinfo <- info_clottingML(dispersion = summary_clottingML$dispersion, type = "observed")
all.equal(oinfo[1:4, 1:4], einfo[1:4, 1:4])
## [1] TRUE

which is not generally true for a glm with a non-canonical link, as seen below.

clottingML_log <- update(clottingML, family = Gamma("log"))
summary_clottingML_log <- summary(clottingML_log)
info_clottingML_log <- get_information_function(clottingML_log)
einfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "expected")
oinfo_log <- info_clottingML_log(dispersion = summary_clottingML_log$dispersion, type = "observed")
round(einfo_log, 3)
##                (Intercept) log(conc)     lot2 log(conc):lot2 dispersion
## (Intercept)        757.804  2508.115  378.902       1254.058       0.00
## log(conc)         2508.115  8971.520 1254.058       4485.760       0.00
## lot2               378.902  1254.058  378.902       1254.058       0.00
## log(conc):lot2    1254.058  4485.760 1254.058       4485.760       0.00
## dispersion           0.000     0.000    0.000          0.000   16078.16
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##     5.50322528    -0.60191618    -0.58447142     0.03448168 
## attr(,"dispersion")
## [1] 0.02375284
round(oinfo_log, 3)
##                (Intercept) log(conc)     lot2 log(conc):lot2 dispersion
## (Intercept)        757.804  2508.114  378.902       1254.057      0.000
## log(conc)         2508.114  9056.792 1254.057       4527.583     -0.041
## lot2               378.902  1254.057  378.902       1254.057      0.000
## log(conc):lot2    1254.057  4527.583 1254.057       4527.583     -0.018
## dispersion           0.000    -0.041    0.000         -0.018   7610.128
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##     5.50322528    -0.60191618    -0.58447142     0.03448168 
## attr(,"dispersion")
## [1] 0.02375284

Score tests

We can now use scores_clottingML and info_clottingML to carry out a score test to compare nested models.

If \(l_\psi(\psi, \lambda)\) is the gradient of the log-likelihood with respect to \(\psi\) evaluated at \(\psi\) and \(\lambda\), \(i^{\psi\psi}(\psi, \lambda)\) is the \((\psi, \psi)\) block of the inverse of the expected information matrix, and \(\hat\lambda_\psi\) is the maximum likelihood estimator of \(\lambda\) for fixed \(\psi\), then, assuming that the model is adequate \[ l_\psi(\psi, \hat\lambda_\psi)^\top i^{\psi\psi}(\psi,\hat\lambda_\psi) l_\psi(\psi, \hat\lambda_\psi) \] has an asymptotic \(\chi^2_{dim(\psi)}\) distribution.

The following code chunk computes the maximum likelihood estimates when the parameters for lot and log(conc):lot are fixed to zero (\(\hat\lambda\psi\) for \(\psi = 0\) above), along with the score and expected information matrix evaluated at them ($l(,_) and \(i(\psi,\hat\lambda_\psi)\), above).

clottingML_nested <- update(clottingML, . ~ log(conc))
enriched_clottingML_nested <- enrich(clottingML_nested, with = "mle of dispersion")
coef_full <- coef(clottingML)
coef_hypothesis <- structure(rep(0, length(coef_full)), names = names(coef_full))
coef_hypothesis_short <- coef(enriched_clottingML_nested, model = "mean")
coef_hypothesis[names(coef_hypothesis_short)] <- coef_hypothesis_short
disp_hypothesis <- coef(enriched_clottingML_nested, model = "dispersion")
scores <- scores_clottingML(coef_hypothesis, disp_hypothesis)
info <- info_clottingML(coef_hypothesis, disp_hypothesis)

The object enriched_clottingML_nested inherits from enriched_glm, glm and lm, and, as illustrated above, enrichwith provides a corresponding coef method to extract the estimates for the mean regression parameters and/or the estimate for the dispersion parameter.

The score statiistic is then

(score_statistic <- drop(scores%*%solve(info)%*%scores))
## [1] 17.15989

which gives a p-value of

pchisq(score_statistic, 2, lower.tail = FALSE)
## [1] 0.000187835

For comparison, the Wald statistic for the same hypothesis is

coef_full[3:4]%*%solve(solve(info)[3:4, 3:4])%*%coef_full[3:4]
##          [,1]
## [1,] 19.18963

and the log-likelihood ratio statistic is

(deviance(clottingML_nested) - deviance(clottingML))/disp_hypothesis
## dispersion 
##    17.6435

which is close to the score statistic

Simulating from glm objects at parameter values

enrichwith also provides the get_simulate_function method for glm or lm objects. The get_simulate_function computes a function to simulate response vectors at /arbitrary/ values of the model parameters, which can be useful when setting up simulation experiments and for various inferential procedures (e.g. indirect inference).

For example, the following code chunk simulates three response vectors at the maximum likelihood estimates of the parameters for clottingML.

simulate_clottingML <- get_simulate_function(clottingML)
simulate_clottingML(nsim = 3, seed = 123)
##        sim_1     sim_2     sim_3
## 1  119.99301 117.71976 124.77001
## 2   55.81195  53.03677  51.33030
## 3   37.29072  37.29519  39.83985
## 4   34.15268  32.39287  33.38394
## 5   30.02089  26.76743  27.99428
## 6   25.41892  24.63002  25.23102
## 7   20.50630  21.33982  20.69055
## 8   18.36304  19.61270  18.50062
## 9   19.39328  19.08656  17.67781
## 10  72.03656  72.99039  71.62121
## 11  33.36883  33.57407  33.34055
## 12  25.09187  24.91749  24.47527
## 13  20.87817  21.60344  23.25175
## 14  18.66966  16.86550  16.78601
## 15  16.35583  15.69062  16.01810
## 16  13.71021  13.65021  13.99123
## 17  12.32866  13.18895  12.59467
## 18  11.68437  11.25791  12.23057

The result is the same to what the simulate method returns

simulate(clottingML, nsim = 3, seed = 123)
##        sim_1     sim_2     sim_3
## 1  119.99301 117.71976 124.77001
## 2   55.81195  53.03677  51.33030
## 3   37.29072  37.29519  39.83985
## 4   34.15268  32.39287  33.38394
## 5   30.02089  26.76743  27.99428
## 6   25.41892  24.63002  25.23102
## 7   20.50630  21.33982  20.69055
## 8   18.36304  19.61270  18.50062
## 9   19.39328  19.08656  17.67781
## 10  72.03656  72.99039  71.62121
## 11  33.36883  33.57407  33.34055
## 12  25.09187  24.91749  24.47527
## 13  20.87817  21.60344  23.25175
## 14  18.66966  16.86550  16.78601
## 15  16.35583  15.69062  16.01810
## 16  13.71021  13.65021  13.99123
## 17  12.32866  13.18895  12.59467
## 18  11.68437  11.25791  12.23057

but simulate_clottingML can also be used to simulate at any given parameter value

coefficients <- c(0, 0.01, 0, 0.01)
dispersion <- 0.001
samples <- simulate_clottingML(coefficients = coefficients, dispersion = dispersion, nsim = 500000, seed = 123)

The empirical means and variances based on samples agree with the exact means and variances at coefficients and dispersion

means <- 1/(model.matrix(clottingML) %*% coefficients)
variances <- dispersion * means^2
max(abs(rowMeans(samples) - means))
## [1] 0.004616079
max(abs(apply(samples, 1, var) - variances))
## [1] 0.00792772

pmodel, dmodel, qmodel

enrichwith also provides the get_dmodel_function, get_pmodel_function and get_qmodel_function methods, which can be used to evaluate densities or probability mass functions, distribution functions, and quantile functions, respectively at arbitrary parameter values and data settings.

For example, the following code chunk computes the density at the observations in the clotting data set under then maximum likelihood fit

cML_dmodel <- get_dmodel_function(clottingML)
cML_dmodel()
##          1          2          3          4          5          6 
## 0.05114789 0.01729857 0.11264667 0.21780853 0.23240296 0.39469095 
##          7          8          9         10         11         12 
## 0.36529389 0.33731787 0.44320377 0.11009363 0.08136892 0.23566642 
##         13         14         15         16         17         18 
## 0.42776887 0.51483860 0.59720662 0.29329145 0.42214857 0.75181329 
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971

We can also compute the densities at user-supplied parameter values

cML_dmodel(coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
##            1            2            3            4            5 
## 1.504831e-05 6.298754e-04 2.784455e-03 5.394485e-03 1.427986e-02 
##            6            7            8            9           10 
## 1.392561e-02 2.241253e-02 2.775377e-02 2.904231e-02 1.573806e-02 
##           11           12           13           14           15 
## 3.429823e-02 4.497386e-02 5.143328e-02 6.281910e-02 7.022318e-02 
##           16           17           18 
## 7.721225e-02 8.508033e-02 9.434798e-02 
## attr(,"coefficients")
## [1] -0.01  0.02 -0.01  0.00
## attr(,"dispersion")
## [1] 0.1

or even at different data points

new_data <- data.frame(conc = 5:10, time = 50:45, lot = factor(c(1, 1, 1, 2, 2, 2)))
cML_dmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
##          1          2          3          4          5          6 
## 0.02366298 0.01889204 0.01428215 0.02659079 0.02591742 0.02433105 
## attr(,"coefficients")
## [1] -0.01  0.02 -0.01  0.00
## attr(,"dispersion")
## [1] 0.1

We can also compute cumulative probabilities and quantile functions. So

cML_qmodel <- get_qmodel_function(clottingML)
cML_pmodel <- get_pmodel_function(clottingML)
probs <- cML_pmodel(data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
cML_qmodel(probs, data = new_data, coefficients = c(-0.01, 0.02, -0.01, 0), dispersion = 0.1)
##  1  2  3  4  5  6 
## 50 49 48 47 46 45 
## attr(,"coefficients")
## [1] -0.01  0.02 -0.01  0.00
## attr(,"dispersion")
## [1] 0.1

returns new_data$time.

For example the fitted densities when (conc, lot) is c(15, 1), c(15, 2), c(40, 1) or c(40, 2) are

new_data <- expand.grid(conc = c(15, 40),
                        lot = factor(1:2),
                        time = seq(0, 50, length = 100))
new_data$density <- cML_dmodel(new_data)
ggplot(data = new_data) +
    geom_line(aes(time, density)) + facet_grid(conc ~ lot) +
    theme_bw() +
    geom_vline(data = clotting[c(3, 6, 12, 15), ], aes(xintercept = time), lty = 2)

The dashed vertical lines are the observed values of time for the (conc, lot) combinations in the figure.

enriched_glm

enrichwith provides a wrapper to the glm interface that results directly in enriched_glm objects that carry all the components described above. For example,

enriched_clottingML <- enriched_glm(time ~ log(conc) * lot, family = Gamma, data = clotting)
names(enriched_clottingML$auxiliary_functions)
## [1] "score"       "information" "bias"        "simulate"    "dmodel"     
## [6] "pmodel"      "qmodel"
enriched_clottingML$score_mle
##    (Intercept)      log(conc)           lot2 log(conc):lot2     dispersion 
##   1.227818e-05   2.253249e-05   6.113431e-07   1.548145e-06   1.647695e-09 
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971
enriched_clottingML$expected_information_mle
##                (Intercept) log(conc)     lot2 log(conc):lot2 dispersion
## (Intercept)       19327145  40768701  5060148       10887183          0
## log(conc)         40768701  98046714 10887183       26762283          0
## lot2               5060148  10887183  5060148       10887183          0
## log(conc):lot2    10887183  26762283 10887183       26762283          0
## dispersion               0         0        0              0    3376930
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971
enriched_clottingML$observed_information_mle
##                 (Intercept)    log(conc)         lot2 log(conc):lot2
## (Intercept)    1.932714e+07 4.076870e+07 5.060148e+06   1.088718e+07
## log(conc)      4.076870e+07 9.804671e+07 1.088718e+07   2.676228e+07
## lot2           5.060148e+06 1.088718e+07 5.060148e+06   1.088718e+07
## log(conc):lot2 1.088718e+07 2.676228e+07 1.088718e+07   2.676228e+07
## dispersion     7.518920e-03 1.379847e-02 3.743748e-04   9.480544e-04
##                  dispersion
## (Intercept)    7.518920e-03
## log(conc)      1.379847e-02
## lot2           3.743748e-04
## log(conc):lot2 9.480544e-04
## dispersion     3.376930e+06
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971
enriched_clottingML$bias_mle
##    (Intercept)      log(conc)           lot2 log(conc):lot2     dispersion 
##   1.335333e-05  -1.616492e-06   9.431631e-06  -1.339775e-06  -3.627343e-04 
## attr(,"coefficients")
##    (Intercept)      log(conc)           lot2 log(conc):lot2 
##   -0.016554382    0.015343115   -0.007354088    0.008256099 
## attr(,"dispersion")
##  dispersion 
## 0.001632971
enriched_clottingML$dispersion_mle
##  dispersion 
## 0.001632971

References

McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London: Chapman; Hall.