stgam
The stgam
package (Comber,
Harris, and Brunsdon 2024) provides a wrapper around
mgcv
functionality (S. Wood
2021) to support space-time analysis, with a focus on prediction
but perhaps more importantly on inference (process understanding). The
latter is commonly the objective of geographical analysis, which are
often concerned with how processes and statistical relationships vary
over space and / or time.
This vignette illustrates an analysis workflow using GAM smooths that quantifies and models variations over space and time through a house price case study. It highlights the importance of investigations of spatial and / or temporal variability before constructing space-time Generalized Additive Models (GAMs).
You should load the stgam
package and the data. The data
are sample of terraced house sales data for 2018 to 2024 extracted from
Bin et al. (2024), and located via the ONS
lookup tables that links UK postcodes to location (actually the
geometric centroids of the postcode area).
library(cols4all)
library(dplyr)
library(ggplot2)
library(tidyr)
library(sf)
library(cowplot)
# load the package and data
library(stgam)
data("hp_data")
data("lb")
You should examine the help for the datasets and especially the
variables that hp_data
contains:
The first stage is simply to examine the data, particularly the continuous variables over space and time. The code below examines the data before undertaking some initial investigations.
# examine what is loaded
hp_data
#> # A tibble: 1,888 × 13
#> price priceper tfa dot yot beds type cef pef ageb lad X Y
#> <dbl> <dbl> <dbl> <date> <dbl> <int> <chr> <int> <int> <chr> <chr> <int> <int>
#> 1 1000 7463. 134 2023-02-22 2023 9 T 60 82 1930-1949 E09000014 531674 188707
#> 2 520 7879. 66 2022-05-27 2022 3 T 57 74 1930-1949 E09000031 536888 189269
#> 3 325 5078. 64 2021-01-15 2021 4 T 50 82 1950-1966 E09000016 549763 190717
#> 4 300 2400 125 2020-01-17 2020 5 T 64 85 1900-1929 E09000031 538646 186513
#> 5 585 7222. 81 2020-11-13 2020 4 T 75 88 1983-1990 E09000027 514710 172535
#> 6 618. 6786. 91 2023-05-18 2023 4 T 61 84 1950-1966 E09000008 533233 170230
#> 7 1183. 12117. 97.6 2022-04-22 2022 5 T 63 66 1900-1929 E09000027 516471 174772
#> 8 800 6154. 130 2023-09-28 2023 7 T 60 80 1900-1929 E09000010 532032 194831
#> 9 560 6022. 93 2022-12-21 2022 5 T 47 79 1930-1949 E09000029 526674 164298
#> 10 300 3750 80 2019-09-16 2019 4 T 71 86 1950-1966 E09000004 548369 179349
#> # ℹ 1,878 more rows
The analyses will model price per unit area (the
priceper
variable) in units of pounds per spare metre. GAM
smooth models will be constructed that regress this against other
variables in the data including location and time. The print out of the
first 10 records above shows that the dataset contains a time variable
(dot
) in date format as well as location in metres
(X
and Y
from the OSGB projection). It often
more useful to represent time as continuous scalar values and to have
location in kilometres rather than metres. The code below uses the
earliest date in the dataset to create a variable called
days
to represent time (here 100s of days since the
earliest date) and rescales the locational data, but retaining the
original coordinates for mapping:
hp_data <-
hp_data |>
# create continuous time variable
mutate(days = as.numeric(dot - min(dot))/100) |>
relocate(days, .after = dot) |>
# scale location and retain original coordinates
mutate(Xo = X, Yo = Y) |>
mutate(X = X/1000, Y = Y/1000)
Again this can be examined:
The data and the target variable can be mapped with the London
borough spatial layer (lb
) as in the code snippet below.
Note the use of the Xo
and Yo
coordinates in
the mapping with ggplot
. The map above shows the spatial
distribution of priceper
and also indicates that there were
no terraced house sales in the centre of London (in the City of London
local authority district).
# map the data layers
lb |>
ggplot() + geom_sf() +
geom_point(data = hp_data, aes(x = Xo, y = Yo, col = priceper)) +
scale_color_viridis_c(option = "magma") +
theme_bw() +xlab("") + ylab("")
priceper
variable (house price per square metre in
£s).In the analyses described below, location is measured in kilometres,
the lb
spatial dataset of London boroughs will be used to
provide spatial context to some of the results. To do this the
coordinates of the lb
spatial data layer need to be
rescaled in the same way as the X
and Y
variables were in hp_data
to aid model construction. The
code below creates a rescaled version of the lb
projected
in kilometres (remember the original data can always be reloaded using
data(lb)
).
# transform to km
lb <- st_transform(lb, pipeline = "+proj=pipeline +step +proj=unitconvert +xy_out=km")
# remove the projection to avoid confusing ggplot
st_crs(lb) <- NA
The correlations and distributions of the numeric variables can be
examined. It is important to establish that any regression of the
proposed target variable against the predictor variables may yield
something interesting and sensible, and just as importantly to identify
any variables that might be a problem when trying to make a model. For
example, in this case it is good that the proposed target variable
(priceper
) has a healthy tail.
The code below generates boxplots and then density histograms.
# boxplots
hp_data |>
select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
pivot_longer(-lad) |>
ggplot(aes(x = value), fil) +
geom_boxplot(fill="dodgerblue") +
facet_wrap(~name, scales = "free") +
theme_bw()
# histograms
hp_data |>
select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
pivot_longer(-lad) |>
ggplot(aes(x = value), fil) +
geom_histogram(aes(y=after_stat(density)),bins = 30,
fill="tomato", col="white") +
geom_density(alpha=.5, fill="#FF6666") +
facet_wrap(~name, scales = "free") +
theme_bw()
Examining correlations is for similar reasons: to check for issues that may be a problem later on in the analysis workflow. Here the aim is to establish that there are reasonable correlations with the target variable and no collinearity amongst the predictor variables, when they are examined without considering time or space (i.e. they are examined globally).
# correlations
hp_data |>
select(priceper, price, tfa, beds, cef, pef) |>
cor() |> round(3)
#> priceper price tfa beds cef pef
#> priceper 1.000 0.723 0.308 0.179 -0.118 -0.220
#> price 0.723 1.000 0.791 0.549 -0.082 -0.216
#> tfa 0.308 0.791 1.000 0.780 -0.034 -0.202
#> beds 0.179 0.549 0.780 1.000 -0.079 -0.170
#> cef -0.118 -0.082 -0.034 -0.079 1.000 0.388
#> pef -0.220 -0.216 -0.202 -0.170 0.388 1.000
Finally, thinking about space-time analysis of your dataset, as a general heuristic, any dataset for use in space-time analysis should have a minimum of about 100 locations and a minimum of 50 observation time periods. You should consider this when you are assembling your data.
The investigations in the previous section were all concerned with
establishing the viability of undertaking the proposed analysis, using
standard exploratory data analysis approaches to examine distributions
and correlations. In this section these are extended to test for the
presence of variability in the data but, now over time, over space and
in space-time. This is done by constructing a series of regression
models, of different forms and with different parameters. Tools for
constructing Generalized Additive Models (GAMs) using the
mgcv
package (S. Wood 2021)
are introduced but with particular focus on using GAM smooths or splines
to explore variations.
As an initial baseline model, the code below creates a standard OLS
regression model of priceper
using the lm
function. The model fit is weak, but the model summary indicates that
the some of the variables (pef
and beds
) are
significant predictors of the target variable. The analysis of variance
(ANOVA), calculated using the anova
function, shows how
much variance in priceper
is explained by each predictor
and whether that contribution is statistically significant. Here higher
values in the F value
test statistic provides stronger
evidence that the predictor is useful.
In this case there is evidence that two of the predictors
(pef
and beds
) are statistically significant
and the F-values and associated p-values (Pr(>F)
) show
strong evidence that each explains some of the variation in
priceper
.
# an OLS model
m_ols <- lm(priceper ~ cef + pef + beds, data = hp_data)
summary(m_ols)
#>
#> Call:
#> lm(formula = priceper ~ cef + pef + beds, data = hp_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7057.6 -1752.2 -554.9 1220.4 23271.9
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10751.489 690.879 15.562 < 2e-16 ***
#> cef -9.633 6.325 -1.523 0.128
#> pef -59.626 8.047 -7.410 1.90e-13 ***
#> beds 298.329 46.327 6.440 1.52e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2740 on 1884 degrees of freedom
#> Multiple R-squared: 0.06997, Adjusted R-squared: 0.06849
#> F-statistic: 47.25 on 3 and 1884 DF, p-value: < 2.2e-16
anova(m_ols)
#> Analysis of Variance Table
#>
#> Response: priceper
#> Df Sum Sq Mean Sq F value Pr(>F)
#> cef 1 2.1265e+08 212648515 28.323 1.149e-07 ***
#> pef 1 5.4018e+08 540183718 71.948 < 2.2e-16 ***
#> beds 1 3.1134e+08 311344258 41.468 1.516e-10 ***
#> Residuals 1884 1.4145e+10 7508013
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A final and quick trick to confirm the presence of space-time effects
is to use them as a dummy variable with one of the variables. Here this
is done with tfa
(total floor area) that unsurprisingly is
highly correlated with the target variable (priceper
):
# Dummy with LADs
m_dummy1 <- lm(priceper ~ tfa:lad, data = hp_data)
summary(m_dummy1)
anova(m_dummy1)
This shows that there are important and significant differences
between the first London borough (which has a lad
value
equal to “E09000001”) that is not shown in the model summary and many of
the the rest that are, suggesting some important spatial differences in
the interaction of floor area and price in different boroughs. This is
confirmed by the F values in the ANOVA.
Similarly, an even larger effect with time is demonstrated when
days
is used as the dummy variable:
# Dummy with Time
m_dummy2 <- lm(priceper ~ tfa:days, data = hp_data)
summary(m_dummy2)
anova(m_dummy2)
Thus, despite the models being globally weak with low \(R^2\) values, there is evidence of spatial and temporal interactions with the target variable that warrant further more formal exploration with respect to potential space-time trends. GAMs with smooths offer a route to investigate these.
GAMs with splines (or smooths - the terms are used interchangeably) can be used to explore and quantify spatial and temporal variations. However before considering these aspects, it is important to outline what splines do.
Splines help to fit a (smooth) curve through a cloud of messy data points. They generate flexible fits (curves) able to adapt to the shape of the data. Because of this, they are able to describe the relationships between variables better than a straight line. For example, consider the scatterplot of some simulated data below: it would be difficult to fit a straight line through it.
# create simulated data
set.seed(12)
x <- runif(500)
mu <- sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
y <- rnorm(500, mu, .3)
# plot x and y
ggplot() +
geom_point(aes(x,y)) +
theme_bw()
x
and y
data.The code below uses a GAM spline to determine the relationship
between x
and y
without having to pre-specify
any particular form (e.g. quadratic, exponential, etc). The wiggliness
of the fit is determined automatically by the s
function.
This is done in different ways depending on the type of smooth that is
specified (thin plate regression smooths and the defaults in
s
from the mgcv
package). Effectively what the
smooth does is to split the data into chunks and fit piecewise
relationships, rather than fitting a single straight line as in a
standard linear regression.
# a GAM illustration with a spline using s
gam_s_example <- gam(y~s(x))
# extract the smooth fit
y.s <- gam_s_example$fitted.values
# plot
ggplot() +
geom_point(aes(x,y), col = "grey") +
geom_line(aes(x, y = y.s), lwd = 1) +
theme_bw()
x
and y
data
with GAM a Spline fitted.The purpose of this diversion into splines and smooths is to
illustrate how they model different (slopes) relationships with
y
at different locations within the variable feature space,
(the x
axis in the above example). In this way spline
smooth curves can be used to capture non-linear relationships in
attribute space (here the relationship of x
with
y
). Importantly, in the context of space-time varying
coefficient modelling with stgam
, the spline can be used to
model how relationships between x
and y
varies
with respect to time or location if the smooth is also parameterised
with those.
It would be useful to examine the how the target variable,
priceper
, varies over time. One way of doing this is to
extend the use GAM smooths illustrated above. The code snippet below
models the variation in priceper
but this time using time
(the days
variable).
# the first GAM
gam.1 <- gam(priceper~s(days), data = hp_data)
summary(gam.1)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(days)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 64.11 106.6 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(days) 3.798 4.708 15.76 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.0374 Deviance explained = 3.93%
#> GCV = 7.7786e+06 Scale est. = 7.7588e+06 n = 1888
The smooth can be investigated by plotting it with time. Note the use
of the actual data variable dot
rather than
days
in the code below to have a friendly x-axis in the
plot, and the use of the predict
function to extract the
standard errors:
# create a data frame with x, predicted y, standard error
x <- hp_data$dot
y <- gam.1$fitted.values
se <- predict(gam.1, se = TRUE, hp_data)$se.fit
u <- y+se
l <- y-se
df <- data.frame(x, y, u, l)
# plot!
ggplot(df, aes(x, y, ymin = l, ymax = u)) +
geom_ribbon(fill = "lightblue") +
geom_line() +
theme_bw() +
xlab("Date") + ylab("priceper")
This shows variation in the modelled relationship of
priceper
with time. The code above extracts the predicted
values (\(\hat{y}\)) of
priceper
from the model and plots them against time, but
using the actual date not the days
continuous variable. The
plot shows how the relationship of the target variable varies with time,
potentially reflecting the effects over the pandemic, with increases in
the value of homes with their own gardens and outdoor spaces (even small
ones!) over apartments or flats. This provides confirmation of the
potential suitability of a temporal analysis of
priceper
.
The spatially varying trends can be examined in similar way, again
using a standard s
spline with location.
# the second GAM
gam.2 <- gam(priceper~s(X,Y), data = hp_data)
summary(gam.2)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 43.62 156.6 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.47 28.85 81.52 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.554 Deviance explained = 56.1%
#> GCV = 3.6469e+06 Scale est. = 3.5919e+06 n = 1888
plot(gam.2, asp = 1)
mgcv
plot of the spatial
smooth.Notice that the model fit (\(R^2\))
has increased substantially from the previous models. This suggests some
important spatial effects and again the smooth can be plotted, but this
time rather than a line it is 2-Dimensional surface. However, it would
be useful to to visualise this as a surface rather than just over the
hp_data
point locations. The code below uses the
predict
function used to model the relationship over a
hexagonal grid (a surface) covering the extent of the London boroughs
created from lb
, that contains X
and
Y
attributes of the location of the hexagonal cell
centroid.
# 1. create a grid object the study area from the LB data
l_grid <-
st_make_grid(lb, square=FALSE,n=50) |>
st_sf() |>
st_join(lb) |>
filter(!is.na(name))
# rename the geometry, sort row names
st_geometry(l_grid) = "geometry"
rownames(l_grid) <- 1:nrow(l_grid)
# create and add coordinates X and Y
coords <- l_grid |> st_centroid() |> st_coordinates()
#> Warning: st_centroid assumes attributes are constant over geometries
l_grid <- l_grid |> bind_cols(coords)
You may wish to inspect this object:
Before continuing with the mapping procedure:
# 2. predict over the grid
yhat <- predict(gam.2, newdata = l_grid)
l_grid |> mutate(yhat = yhat) |>
# 4.and plot
ggplot() +
geom_sf(aes(fill = yhat), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "priceper") +
# add context
geom_sf(data = lb, fill = NA) +
# apply and modify plot theme
theme_bw() +
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"))
This models the spatial trends (variations over space) of the target
variable. The map indicates high values in the centre of the study area,
as in the initial map of the data earlier, and again provides
confirmatory evidence that the some spatial modelling of
priceper
is appropriate.
The spline approach can be extended to model the target variable in
space-time by including the location and time variables in the smooth.
However, a slightly different set of considerations are required because
of mixing space and time in the mgcv
smooths. Consider the
code snippet below for the space-time smooths it specifies a smooth
using the t2
function rather than s
and
contains other parameters in the d
argument.
# the third GAM
gam.3 <- gam(priceper~t2(X,Y,days, d = c(2,1)), data = hp_data)
summary(gam.3)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ t2(X, Y, days, d = c(2, 1))
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 42.57 160.5 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> t2(X,Y,days) 47.87 52.07 40.12 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.576 Deviance explained = 58.6%
#> GCV = 3.5118e+06 Scale est. = 3.4209e+06 n = 1888
It is important to consider the difference between
s(X,Y,days)
(which would a continuation of the format above
but has not been run) and t2(X,Y,days,d=c(2,1))
before
examining the resultant GAM (gam.3
). The spline functions
s()
and t2()
can model smooth interactions
between multiple variables, but they handle dimension scaling, marginal
smoothness, and basis construction in different ways.
s()
is an Isotropic smooth. It is used when
all variables are on the same scale and have similar smoothness
behaviours (like X
and Y
). By default it uses
a thin plate regression spline (bs = "tp"
) and a single
basis to model the attribute space. The formulation
s(X,Y,days)
would therefore use a single basis to model a
smooth in 3D space, under the assumption that all the variables
(X
, Y
and days
) contribute in the
same way and at the same scale. This is fine if the variables are
numeric, continuous, and in comparable units such as percentages. But
not space and time.
t2()
is a Tensor Product (TP) smooth. These
are used when variables are on different scales or represent different
kinds of effects (like X
, Y
and
days
). It builds a smooth using separate basis functions
for each variable and then combines them. The formulation
t2(X,Y,days,d=c(2,1))
includes the d
parameter
that specifies the dimensions for each basis. Here a 2D basis is
specified for X
and Y
(location) and a 1D
basis for days
(time). The tensor product smooth is
constructed by constructing (marginal) smooths for each and taking the
tensor product of those to form the joint smooth. It is useful when the
variables to be included in the smooth have different scales or units
(e.g., X
and Y
are coordinates,
days
is time) and different degrees of smoothness are
expected along each variable.
Thus, in this case a TP smooth is specified because space and time
are combined. The code below examines the results and uses the
mgcv
plot function to show the spatial distributions over 9
discretised time chunks (~310 days).
However it is also possible to slice and dice the time variable in
other ways. The code snippet below illustrates how this can be done for
365 day periods using the calculate_vcs
function in the
stgam
package. If predictor variables are included in the
terms
parameter then the function coefficient estimates
(“b_”) and standard errors (“se_”) for each of terms specified. However,
in this case we are just interested in the predicted value of the target
variable, yhat
. The code below uses the grid object created
above (l_grid
) as a spatial framework to hold the
prediction of the priceper
variable over these discrete
time periods.
# 1. create time intervals (see the creation of days variable above)
pred_days = seq(365, 2555, 365)/100
# 2. create coefficient estimates for each time period (n = 7)
res_out <- NULL
for (i in pred_days){
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
mgcv_model = gam.3,
terms = NULL)
res_out <- cbind(res_out, res.i$yhat)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y_", 2018:2024)
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(-name, -lad, -X, -Y) |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = paste0("Y_", 2018:2024))) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "Predicted \n'priceper'") +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
This time series of maps show that without any exploratory variables,
the trends in priceper
varies spatially and changes in
intensity over time (the increase and decrease in values over space),
with limited variation in spatial pattern over time.
The formula used to construct the gam.3
model in the
previous subsection constructed a smooth that implicitly combined space
and time, under an assumption that the variations in space and time of
priceper
interact with each other. However this may not be
the case. More generally spatial and temporal dependencies might not be
expected to interact in data for relatively small region, whose
observations might be expected to be subject to the same socio-economic
or environmental pressures. In this case, any space and time effects
might be expected to be independent of each other. Whereas in a
national study any space and time effects might be expected to
interact more strongly.
It is possible to use a different construction involving separate
smooths for space and time, with the s()
splines, and to
avoid the assumption of the interaction of space and time effects:
# the fourth GAM
gam.4 <- gam(priceper~s(X,Y) + s(days), data = hp_data)
summary(gam.4)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y) + s(days)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6832.48 42.45 161 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.509 28.861 83.76 <2e-16 ***
#> s(days) 3.536 4.392 23.88 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.578 Deviance explained = 58.5%
#> GCV = 3.4606e+06 Scale est. = 3.4019e+06 n = 1888
Again there is evidence of spatial and temporal trends in the data
and these can be visualised using the plot.gam
function in
mgcv
(called with just plot
).
mgcv
plot of the spatial and
temporal smooths.Note that, the space-time interactions could be generated in the same
way as for the gam.3
model above, using the location
(X
and Y
) values in l_grid
and
the time slices in pred_days
.
Up until now, only the response (target) variable,
priceper
has been examined for variations in time and
space, using GAM smooths specified in different ways. This was to
explore space-time variability of the variable, but also to introduce
GAM smooths. The code below includes the pef
predictor
variable in a GAM model as a fixed parametric term (i.e. in the same way
as in the OLS model created above).
# the fifth GAM
gam.5 <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8826.648 408.776 21.593 < 2e-16 ***
#> pef -24.802 5.057 -4.905 1.02e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y) 27.507 28.861 77.85 <2e-16 ***
#> s(days) 3.473 4.316 24.89 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.583 Deviance explained = 59%
#> GCV = 3.4216e+06 Scale est. = 3.3618e+06 n = 1888
The model summary indicates that pef
is a significant
predictor of priceper
and improves the model fit. However,
this raises a key question of how should predictor variables be included
in the model: in parametric form or in a smooth? This is addressed in
the next section.
As before the smooths can be plotted:
This section has introduced GAMs with smooths (splines) as a way of
investigating any space-time effects present in the data. The mechanics
of smooths was described and illustrated, and a brief introduction to
different smooth forms was provided. Variations in the response variable
(priceper
) in time, space and space-time were explored
through elementary model fitting and plotting the smooth graphical
trends. In this case, effects in space and time are present in the
priceper
predictor variable and it looks like there is a
universal spatial trend at all times, and universal temporal trend
everywhere. The final GAM model included an explanatory variable
(pef
) which added some explanatory power but these were not
examined with respect to space or time (i.e. in smooths). This is done
in the next section.
Such investigations are an important initial step. They provide evidence of space-time variations in the response variable - i.e. whether the effects in space and time are present in the data - and can help determine which predictor variables may be of further interest. They guide subsequent analysis and avoid making assumptions about the presence of space-time interactions, for example by plugging every predictor variable into a space-time smooth of some kind.
The previous section examined variability of the response variable in
time, space and space-time (the latter in 2 different ways). In this
section these investigations are extended to consider the variation of
the response variable (priceper
) with the pef
predictor variable, but this time with the Intercept as an addressable
term in the model.
Consider the gam.5
model created above:
This includes the terms s(X,Y) + s(days)
. These model
the spatially varying and temporally varying Intercept plus error. Note
that if this was a spatial model (without time), then just
s(X,Y)
would be included for the Intercept and similarly
just s(days)
for a purely temporal model.
The gam.5
model can be reformulated as follows with an
addressable Intercept term:
gam.5.new <- gam(priceper ~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + pef,
data = hp_data |> mutate(Intercept = 1))
summary(gam.5.new)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8826.648 408.776 21.593 < 2e-16 ***
#> pef -24.802 5.057 -4.905 1.02e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 27.507 28.861 77.85 <2e-16 ***
#> s(days):Intercept 3.473 4.316 24.89 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.583 Deviance explained = 94%
#> GCV = 3.4216e+06 Scale est. = 3.3618e+06 n = 1888
The model summary now includes Intercept
as a parametric
term (rather than (Intercept)
) with
s(X,Y):Intercept
and s(days):Intercept
as
smooth terms, rather than s(X,Y)
and s(days)
,
but the values in both summaries are all the same.
The formula in subsequent models in this vignette include the Intercept as an addressable term. The code below creates this variable in the input data:
The code below specifies a GAM regression model with smooths for
space and time as in the previous section but now with the
pef
predictor variable, included in parametric form and in
a temporal smooth with days
. The model suggests, in this
case, that it is important to model this relationship over time.
gam.t <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + s(days, by = pef),
data = hp_data)
summary(gam.t)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(days, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8738.828 410.218 21.303 < 2e-16 ***
#> pef -11.846 2.538 -4.668 3.27e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 27.526 28.864 78.064 < 2e-16 ***
#> s(days):Intercept 3.507 4.358 4.389 0.00119 **
#> s(days):pef 1.500 1.500 6.255 0.00294 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 49/50
#> R-sq.(adj) = 0.584 Deviance explained = 94%
#> GCV = 3.4159e+06 Scale est. = 3.3544e+06 n = 1888
The summary of this model indicates that the relationship of
pef
with priceper
changes over time and has a
strong linear negative trend (the smooth s(days):pef
is
significant as is the pef
parametric term). The nature of
the temporally varying coefficients can be examined using the
calculate_vcs
function in the stgam
package.
This returns a tibble
with coefficient estimates (“b_”) and
standard errors (“se_”) for each of terms specified.
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.t, terms = c("Intercept", "pef"))
head(vcs)
#> # A tibble: 6 × 22
#> price priceper tfa dot days yot beds type cef pef ageb lad X Y Xo Yo Intercept
#> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <int> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <int> <int> <dbl>
#> 1 1000 7463. 134 2023-02-22 18.8 2023 9 T 60 82 1930-19… E090… 532. 189. 531674 188707 1
#> 2 520 7879. 66 2022-05-27 16.0 2022 3 T 57 74 1930-19… E090… 537. 189. 536888 189269 1
#> 3 325 5078. 64 2021-01-15 11.1 2021 4 T 50 82 1950-19… E090… 550. 191. 549763 190717 1
#> 4 300 2400 125 2020-01-17 7.44 2020 5 T 64 85 1900-19… E090… 539. 187. 538646 186513 1
#> 5 585 7222. 81 2020-11-13 10.4 2020 4 T 75 88 1983-19… E090… 515. 173. 514710 172535 1
#> 6 618. 6786. 91 2023-05-18 19.6 2023 4 T 61 84 1950-19… E090… 533. 170. 533233 170230 1
#> # ℹ 5 more variables: b_Intercept <dbl>, se_Intercept <dbl>, b_pef <dbl>, se_pef <dbl>, yhat <dbl[1d]>
The temporal variation of the relationship of pef
with
the target variable can be plotted, and in this case the linear trend is
confirmed: the negative relationship of pef
with the target
variable decreases linearly over time.
vcs |>
mutate(u = b_pef + se_pef,
l = b_pef - se_pef) |>
ggplot(aes(x = dot, y = b_pef, ymin = l, ymax = u)) +
geom_ribbon(fill = "lightblue") +
geom_line() +
theme_bw() +
xlab("Date") + ylab("pef")
pef
.The spatial variation in the predictor variable relationships with
the target variable can also be examined. The code below specifies
pef
within a spatial smooth, but again with a spatially and
temporally varying intercept. Notice how pef
is included in
both the smooth and as parametric (global) term.
gam.s <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + s(X,Y, by = pef),
data = hp_data)
summary(gam.s)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(X, Y, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8229.6 442.6 18.593 <2e-16 ***
#> pef -111.8 168.1 -0.665 0.506
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 29.000 29.000 5.742 < 2e-16 ***
#> s(days):Intercept 3.536 4.391 24.374 < 2e-16 ***
#> s(X,Y):pef 21.034 25.628 2.067 0.00122 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 69/70
#> R-sq.(adj) = 0.594 Deviance explained = 94.2%
#> GCV = 3.3672e+06 Scale est. = 3.269e+06 n = 1888
Here is is evident at pef
is varying significantly over
space (see s(X,Y):pef
) but is not significant as a global
fixed term. That is, the coefficient estimate global slope, is not
significantly different from zero, but is when varying over space and
allowing for Intercept to vary over space and time. This finding could
be related to the age of the houses, which were built in clusters in
different locations.
Again the varying coefficients can be extracted from the model for
the predictor variable and mapped both at observation locations and over
the hexagonal grid if a time period for the Intercept smooth in
gam.s
is specified:
# 1.over observation locations
vcs <- calculate_vcs(input_data = hp_data,
mgcv_model = gam.s,
terms = c("Intercept", "pef"))
tit <-expression(paste(""*beta[`pef`]*""))
p1 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) +
scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw() +
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"),) +
xlab("") + ylab("")
# 2. over grid - recall it needs an intercept term and a days value!
vcs <- calculate_vcs(input_data = l_grid |> mutate(Intercept = 1, days = mean(hp_data$days)),
mgcv_model = gam.s,
terms = c("Intercept", "pef"))
p2 <-
ggplot() +
geom_sf(data = vcs, aes(fill = b_pef), col = NA) +
scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw()+
theme(legend.position = "bottom",
legend.key.width = unit(1, "cm"),) +
xlab("") + ylab("")
plot_grid(p1, p2)
pef
over the original observations and the hexagonal grid.Here the relationship of pef
, the Potential energy
efficiency rating, varies over space with a distinct inflection from
negative values in the centre of the study area to positive ones in
outer regions.
It is also possible to examine the space-time dependencies between
the target variable and the predictor variables. The code below
specifies a GAM with a space-time smooth for the pef
variable and a spatially and temporally varying intercept. As before, a
TP smooth is used and specified with the dimensions of each basis (a 2D
basis for X
and Y
; a 1D basis for
days
), as different degrees of smoothness are expected
across space and time. In this case, the pef
variable is
still not significant globally, but it is over space and time
(t2(X,Y,days):pef
).
gam.st1 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + t2(X,Y,days, d= c(2,1), by = pef),
data = hp_data)
summary(gam.st1)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + t2(X, Y, days, d = c(2, 1), by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8686.866 411.590 21.106 < 2e-16 ***
#> pef -10.827 2.262 -4.786 1.83e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 27.514 28.862 65.183 < 2e-16 ***
#> s(days):Intercept 3.519 4.372 4.383 0.00119 **
#> t2(X,Y,days):pef 5.555 5.556 5.862 1.63e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 164/165
#> R-sq.(adj) = 0.584 Deviance explained = 94%
#> GCV = 3.4246e+06 Scale est. = 3.3556e+06 n = 1888
When space-time varying coefficients are examined (see below), the
increasingly negative relationship of pef
with the target
variable over time is evident and the spatial distributions generally
indicate a lower relationship with the target variable in the east of
the study area and increasingly negative one to the west.
# calculate the varying coefficient estimates
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.st1, terms = c("Intercept", "pef"))
# temporal trends
p_time <-
vcs |>
select(dot, b_Intercept, b_pef) |>
pivot_longer(-dot) |>
mutate(name = recode(name,
"b_Intercept" = '""*beta[Intercept]',
"b_pef" = '""*beta[pef]')) |>
ggplot(aes(x = dot, y = value)) +
geom_point(alpha = 0.1) +
geom_smooth() +
facet_wrap(~name, labeller = label_parsed, scale = "free", ncol = 1) +
theme_bw() + xlab("Year") + ylab("")
# spatial trends
tit <-expression(paste(""*beta[`Intercept`]*""))
p_sp1 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_Intercept), alpha = 1) +
scale_colour_continuous_c4a_seq("brewer.yl_gn_bu", name = tit) +
theme_bw() +
xlab("") + ylab("")
tit <-expression(paste(""*beta[`pef`]*""))
p_sp2 <-
ggplot() + geom_sf(data = lb, col = "lightgrey") +
geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) +
scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
theme_bw() +
xlab("") + ylab("")
plot_grid(p_time, plot_grid(p_sp1, p_sp2, ncol = 1), nrow = 1, rel_widths = c(3.5,6))
pef
.Of course it is also possible to examine discrete time slices of the
spatial distribution of the relationship of pef
with
priceper
. This can be done over grid surfaces as was done
before. Notice in the code below how the terms for the Intercept and
pef
are extracted for each time period in the looped call
to calculate_vcs
. This east-west trend and its changes over
time are confirmed.
# 1. create time intervals (as above)
pred_days = seq(365, 2555, 365)/100
# 2. create coefficient estimates for each time period (n = 7)
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in pred_days){
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
mgcv_model = gam.st1,
terms = c("Intercept", "pef"))
# select just the coefficient estimates of interest
res.i <- res.i |> st_drop_geometry() |> select(starts_with("b_pef"))
res_out <- cbind(res_out, res.i)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y", "_", 2018:2024)
# define a title
tit <-expression(paste(""*beta[`pef`]*""))
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(starts_with("Y_")) |>
# rename
rename(`2018` = "Y_2018", `2019` = "Y_2019", `2020` = "Y_2020",
`2021` = "Y_2021", `2022` = "Y_2022", `2023` = "Y_2023", `2024` = "Y_2024") |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = colnames(res_out))) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_div(name = "Potential Energy\nEfficiency") +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
pef
coefficient estimate.The previous section created a GAM model with a single combined
space-time smooth for the pef
predictor variable. It is
also possible to include space and time within separate smooths, as was
done to investigate the variability of the target variable. The code
snippet below does this, and again includes a space-time varying
intercept.
gam.st2 <- gam(priceper~0 + Intercept +
s(X,Y,by=Intercept) + s(days, by=Intercept) +
pef + s(X,Y, by = pef) + s(days, by = pef),
data = hp_data)
summary(gam.st2)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days,
#> by = Intercept) + pef + s(X, Y, by = pef) + s(days, by = pef)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 8204.971 442.323 18.550 <2e-16 ***
#> pef -5.215 13.861 -0.376 0.707
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 28.000 28.000 5.870 < 2e-16 ***
#> s(days):Intercept 3.579 4.443 3.896 0.00277 **
#> s(X,Y):pef 21.654 26.022 2.039 0.00156 **
#> s(days):pef 1.333 1.333 1.432 0.32924
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 77/80
#> R-sq.(adj) = 0.595 Deviance explained = 94.2%
#> GCV = 3.3618e+06 Scale est. = 3.2623e+06 n = 1888
Here the fixed global term is again not significant, and
interestingly neither is separate temporal smooth, as confirmed by the
plots of the smooths.Taken together, this confirms what was found in the
previous subsection with a combined TP smooth: the spatial trend in the
relationship of pef
with priceper
that changes
in intensity over time is confirmed, but the interaction over space does
not change over time. This suggests a super-imposition of spatial and
temporal trends: the temporal smooths with days
is not
significant.
mgcv
plots of the seperate space
and time smooths for the pef
predictor variable.This section has applied smooths to a single predictor variable. These have explored time, space and space-time, by constructing smooths parameterised with time and location. These allow changes in the relationship between the predictor variable and the target variable to be examined over space and time. Importantly, a space-time GAM was constructed in 2 different ways: by including space and time in a single combined smooth, and within separate smooths. These reflect different assumptions about the nature of the space-time dependencies and interactions in the data, and can result in different inferences (understanding) about the process when the significant components of the model are considered. This may not be important if the objective of model construction is prediction, but it is if the objective is process understanding, as is commonly the case in geographical analyses of space-time data.
There 2 important implications that follow from this:
gam.st2
model
above.stgam
: model selectionThe models in construction in Section 3 used a variety of smooths.
For example, the Intercept was modelled a parametric term with separate
space and time smooths (in the gam.st2
model), and the
pef
predictor variable was included in parametric form
(gam.5
), in parametric form with a single space-time smooth
(gam.st1
) and in parametric form with separate space-time
smooths (gam.st2
. This poses the question of which form to
use and how to specify the model? Which is best? A key focus of the
stgam
package is to seek to answer this question. It does
this by creating and evaluating multiple models.
One way to evaluate models is to compare them using some metric.
Commonly AIC (Akaike 1998) corrected AIC
(AICc) (Hurvich and Tsai 1989) or BIC
(Schwarz 1978) are used to compare models
as they provide parsimonious measures of model fit (balancing model
complexity and prediction accuracy). However, there are some
uncertainties over using these to compare mgcv
GAM smooth
models due to variations in the effective degrees of freedom (EDF)
introduced by the smooths. The result is that model EDFs vary depending
on how the smoothing parameters are selected, potentially leading to
over-penalization or under-penalization in BIC calculations. As a result
model Generalized Cross-Validation (GCV) score is recommended for
evaluating and comparing mgcv
GAMs (S. N. Wood 2017; Marra and Wood 2011). GCV
provides an un-biased risk estimator of model fit. It is similar to BIC
as it is able to balance model fit with complexity, but does not suffer
from some of problems of using BIC when applied to GAMs (e.g. bias in
non-parametric model comparison, over-penalises complex smooths etc).
The best model is one that minimises the GCV score.
The code below extracts the GCV from each of the models in the last
section and prints them out in order. Here it can be seen that the GAM
model with pef
specified in separate space-time smooths,
gam.st2
, generated the best model.
In a space-time model there are 6 options for each predictor variable:
The intercept can be treated similarly, but without it being absent (i.e. 5 options). Obviously there are 3 options for spatial models (i., ii. and iii.) and for temporal models (i., ii. and iv.), with each having 2 options for the intercept. Thus for a purely spatial or purely a temporal regression with \(k\) predictor variables there are \(2 \times 3^k\) potential models and for a space-time regression there are \(5 \times 6^k\) potential models to evaluate.
Recall that the hp_data
contains a number of numeric
variables that could be of use in explaining the space-time variations
in the priceper
target variable:
head(hp_data)
#> # A tibble: 6 × 17
#> price priceper tfa dot days yot beds type cef pef ageb lad X Y Xo Yo Intercept
#> <dbl> <dbl> <dbl> <date> <dbl> <dbl> <int> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <int> <int> <dbl>
#> 1 1000 7463. 134 2023-02-22 18.8 2023 9 T 60 82 1930-19… E090… 532. 189. 531674 188707 1
#> 2 520 7879. 66 2022-05-27 16.0 2022 3 T 57 74 1930-19… E090… 537. 189. 536888 189269 1
#> 3 325 5078. 64 2021-01-15 11.1 2021 4 T 50 82 1950-19… E090… 550. 191. 549763 190717 1
#> 4 300 2400 125 2020-01-17 7.44 2020 5 T 64 85 1900-19… E090… 539. 187. 538646 186513 1
#> 5 585 7222. 81 2020-11-13 10.4 2020 4 T 75 88 1983-19… E090… 515. 173. 514710 172535 1
#> 6 618. 6786. 91 2023-05-18 19.6 2023 4 T 61 84 1950-19… E090… 533. 170. 533233 170230 1
These include cef
(Current energy efficiency rating)
pef
(Potential energy efficiency rating) and
beds
(Number of bedrooms), as well as location
(X
and Y
) and time (days
). Thus
with \(k = 3\) variables could are 1080
potential models to evaluate in space-time case study and 54 potential
models in a spatial case study. Here 2 of the variables are used to
illustrate an evaluation of 180 space-time models, pef
as
before and beds
.
The evaluate_models()
function in the stgam
package constructs and compares the full set of potential models. It
uses the GCV value of each model to evaluate them. Note that in the code
below, ncores
is set to 2 pass CRAN diagnostic checks - you
may want to specify more using detectCores()-1
from the
parallel
package. Remember that the input data needs to
have a defined Intercept term which can be created in the following way
input_data |> mutate(Intercept = 1))
(this was done at
the start of Section 3 for hp_data
).
library(doParallel)
t1 <- Sys.time()
stvc_mods <- evaluate_models(
input_data = hp_data,
target_var = "priceper",
vars = c("pef", "beds"),
coords_x = "X",
coords_y = "Y",
VC_type = "STVC",
time_var = "days",
ncores = 2)
Sys.time() - t1 # about 14 minutes (6 minutes with 15 cores!)
The best \(n\) models can be
extracted (i.e. those with the lowest GCV score) using the
gam_model_rank()
function in the stgam
package
introduced in Section 3. Interestingly the top 5 ranked models all have
space and time specified for each predictor variable, but in different
combinations of single and separate space-time smooths. The top 4 models
all have the beds
predictor variable specified in separate
space-time smooths, suggesting that while there are spatial and temporal
dependencies with the target variable, there are not interacting
space-time ones.
mod_comp <- gam_model_rank(stvc_mods, n= 10)
# have a look
mod_comp |> select(-f)
#> Rank Intercept pef beds GCV
#> 1 1 t2_ST s_T + s_S s_T + s_S 16813.82
#> 2 2 s_T + s_S s_T + s_S s_T + s_S 16815.01
#> 3 3 t2_ST t2_ST s_T + s_S 16815.67
#> 4 4 s_T + s_S t2_ST s_T + s_S 16822.48
#> 5 5 s_T + s_S s_T + s_S t2_ST 16823.02
#> 6 6 s_S s_T + s_S s_T + s_S 16824.47
#> 7 7 t2_ST s_S s_T + s_S 16826.02
#> 8 8 t2_ST s_T + s_S t2_ST 16826.23
#> 9 9 t2_ST s_T + s_S s_S 16827.10
#> 10 10 s_T + s_S s_S s_T + s_S 16827.90
The best ranked model can be specified for use in analysis. This
included the Intercept in a single space-time TP smooth and separate
space and time smooths for pef
and beds
.
First the formula is extracted and can be inspected:
f <- as.formula(mod_comp$f[1])
f
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) +
#> s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) +
#> s(days, by = beds)
This formula is used to specify a mgcv
GAM model using a
REML approach. The choice of REML is described in Section 5 below. The
resulting model is checked for over-fitting using the
k.check
function in themgcv
package. This
underpins the gam.check
function but does not display the
diagnostic plots. Here the k-index
values are near to 1,
the k'
and edf
parameters are not close, and
importantly the edf
values are all much less than the
k'
values, so this model is reasonably well tuned.
NB if the k'
and edf
parameters are close for some smooths, then you may want to increase the
k
parameter in the relevant smooth. These are automatically
determined by the mgcv
package but can be specified
manually - see the mgcv
help for k.check
and
choose.k
.
# specify the model
gam.m <- gam(f, data = hp_data, method = "REML")
# check k
k.check(gam.m)
#> k' edf k-index p-value
#> t2(X,Y,days):Intercept 124 30.729788 0.8165558 0.0000
#> s(X,Y):pef 30 2.055193 0.6432838 0.0000
#> s(days):pef 10 3.183809 1.0422023 0.9675
#> s(X,Y):beds 30 24.222744 0.6432838 0.0000
#> s(days):beds 10 2.936562 1.0422023 0.9550
A summary of the model can be examined and this shows that nearly all
of the terms are significant except the spatial smooth for
pef
and the temporal smooth for beds
.
summary(gam.m)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) +
#> s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) +
#> s(days, by = beds)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 10040.1 454.6 22.09 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> t2(X,Y,days):Intercept 30.730 37.032 2.598 1.36e-06 ***
#> s(X,Y):pef 2.055 2.082 1.247 0.28327
#> s(days):pef 3.184 3.673 4.121 0.00347 **
#> s(X,Y):beds 24.223 26.353 5.647 < 2e-16 ***
#> s(days):beds 2.937 3.349 1.062 0.31011
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 203/205
#> R-sq.(adj) = 0.601 Deviance explained = 61.5%
#> -REML = 16814 Scale est. = 3.2129e+06 n = 1888
The spatially and temporally varying coefficients estimates generated
by the model can be extracted in a number of different ways using the
calulate_vcs
function in the stgam
package, as
indicated in earlier sections. There are basically 2 options for
generating the varying coefficient estimates: i) over the original data
points or ii) over spatial surfaces for specific time slices.
For the first option the original data, the GAM model and a vector of
the model terms (i.e. predictor variables names) are simply passed to
the calculate_vcs
function. The second option requires
slices of the space-time data to be passed to the
calulate_vcs
function and a bit more work to process the
results. In both cases the results can be summarised and mapped.
Considering first the data of the original observations:
vcs <- calculate_vcs(input_data = hp_data,
mgcv_model = gam.m,
terms = c("Intercept", "pef", "beds"))
A summary over space and time of the coefficient estimates shows that
the intercept the is large and positive and that pef
and
beds
are generally negative but with some positive values
at in some places or at some times.
vcs |> select(starts_with("b_")) |>
apply(2, summary) |> round(1)
#> b_Intercept b_pef b_beds
#> Min. 4236.0 -60.0 -552.4
#> 1st Qu. 8303.6 -34.5 -369.7
#> Median 10187.8 -26.3 -260.3
#> Mean 10040.1 -26.1 -240.8
#> 3rd Qu. 11703.2 -17.6 -163.4
#> Max. 15778.7 7.7 388.0
These can be examined over time:
vcs |>
select(dot, starts_with("b_")) |>
rename(`Intercept` = b_Intercept,
`Potential Energy Efficiency` = b_pef,
`Bedrooms` = b_beds) |>
pivot_longer(-dot) |>
mutate(name = factor(name,
levels=c("Intercept","Potential Energy Efficiency", "Bedrooms"))) |>
group_by(dot, name) |>
summarise(
lower = quantile(value, 0.25),
median = median(value),
upper = quantile(value, 0.75)
) |>
ggplot(aes(x = dot, y = median)) +
geom_point(col = "blue", alpha = 0.2) +
geom_smooth() +
facet_wrap(~name, scale = "free_y") +
theme_bw() + xlab("") + ylab("") +
theme(strip.background = element_rect(fill="white"))
#> `summarise()` has grouped output by 'dot'. You can override using the `.groups` argument.
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Standard dplyr
and ggplot
approaches can be
used to join and map the coefficient estimates as in the figure below,
which summarises the coefficient estimates for pef
(b_pef
) by year.
# make spatial data
vcs_sf <-
vcs |>
st_as_sf(coords = c("X", "Y"), remove = F)
# plot
ggplot() +
geom_sf(data = lb) +
geom_sf(data = vcs_sf, aes(col = b_pef)) +
scale_colour_continuous_c4a_div(palette="brewer.rd_bu",
name = "Potential\nEnergy Efficiency") +
facet_wrap(~yot) +
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
pef
(Potential Energy
Efficiency) coefficient estimates over space and time.Finally the l_grid
can be used to summarise the
coefficient estimates over space and time, here for each year in a
slightly different way to the use of pred_days
above.
Notice how the for
loop below combines all the coefficient
estimates this time.
The maps reflect the relatively small changes in the relationship
with priceper
over time plotted above.
# create time slices
years <- 2018:2024
# calculate over the grid for each time slice
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in 1:length(years)){
# convert years to days
day.val = (years[i]-2018) * 365 / 100
res.i <- calculate_vcs(input_data = l_grid |> mutate(days = day.val),
mgcv_model = gam.m,
terms = c("Intercept", "pef", "beds"))
# select all the coefficient estimates
res.i <-
res.i |>
st_drop_geometry() |>
select(starts_with("b_"),
starts_with("se_"))
# rename them
names(res.i) <- paste0(names(res.i), "_", years[i])
# bind to the result
res_out <- cbind(res_out, res.i)
cat(years[i], "\t")
}
#> 2018 2019 2020 2021 2022 2023 2024
# title
tit <-expression(paste(""*beta[`beds`]*""))
# join to the grid
l_grid |> cbind(res_out) |>
# select the variables and pivot longer
select(starts_with("b_beds")) |>
# rename
rename(`2018` = "b_beds_2018", `2019` = "b_beds_2019",
`2020` = "b_beds_2020", `2021` = "b_beds_2021",
`2022` = "b_beds_2022", `2023` = "b_beds_2023",
`2024` = "b_beds_2024") |>
pivot_longer(-geometry) |>
# make the new days object a factor (to enforce plotting order)
mutate(name = factor(name, levels = 2018:2024)) |>
# 4. and plot
ggplot() +
geom_sf(aes(fill = value), col = NA) +
# adjust default shading
scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
# facet
facet_wrap(~name, ncol = 3) +
# apply and modify plot theme
theme_bw() +
theme(
legend.position = "inside",
legend.direction = "horizontal",
legend.position.inside = c(0.7, 0.15),
legend.text=element_text(size=10),
legend.title=element_text(size=12),
strip.background = element_rect(fill="white", colour = "white"),
strip.text = element_text(size = 8, margin = margin(b=4)),
legend.key.width = unit(1.5, "cm"),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
beds
(Bedrooms)
coefficient estimates over time and the grid surface.This section has illustrated the the use of functions in the the
stgam
package. It suggests the following workflow:
data.frame
,
tibble
or sf
object containing the data to
have single location and time variables for each observation (in the
above examples these were X
,Y
and
days
), and an Intercept as an addressable term.This workflow evaluates and ranks multiple models using model GCV
value. This was done algorithmically using the
evaluate_models
function. However, this was not undertaken
in isolation. Rather it built on the investigations in Section 3 to
determine whether space and time effects were present in the data. This
set of model investigations were undertaken to both develop and confirm
knowledge of processes related to house prices in London. That is, the
analysis was both considered and contextual. For example, in Section 3
it was determined that a varying intercept term was appropriate and with
a different dataset there may be a need to explore different avenues.
Similarly more of the model space could have been examined, for example
to include cef
in the models. The stgam
package allows these choices to be validated through an automated
approach, providing an exploration of the full set of potential choices.
The the GCV as an unbiased risk estimator was useful helping to evaluate
models
All of the analysis space-time GAM models in Section 4 of this
vignette were specified with method = "REML"
. Model
estimation in GAMs can be conducted in two ways: (a) predictive (i.e.,
out-of-sample minimization) via Generalized Cross-Validation (GCV) and
(b) Bayesian (i.e. attach priors on basis coefficients) via restricted
maximum likelihood (REML). REML tends to provide more stable estimates
of the smoothing parameters (i.e., reduce over-fitting) and tends to
provide more robust estimates of the variance. GCV is more
computationally efficient but can over-fit, especially when errors are
correlated. GCV is set as the default in mgcv
.
In Section 4.3, the GAM model was checked for under- and over-fitting
using the k.check
function in themgcv
package
the advice was to increase k
in the smooths if the
k'
and edf
parameters were close. This can be
done by specifying it manually in the smooth as in the hypothetical
example below.
The number of knots in the smooth (\(k\)) determines the complexity of the
response-to-predictor variable smooths in the GAM. However, determining
k
is a balance between setting it too large which can
result in over-fitting and under-fitting if it is too low. Computational
issues (speed and model stability) can arise if k
is set to
be too large, especially in the space-time case for small datasets.
We think these talks by Noam Ross provide really useful tips and
pointers for working with GAMs and mgcv
:
The rationale for using GAMs with TP smooths for spatially varying coefficient or temporally varying coefficient models is as follows:
The workflow suggested in this vignette and in the stgam
package is to determine the most appropriate model form (both steps
evaluated by minimising GCV, an un-biased risk estimator the model
recommended for evaluating mgcv
GAMs (S. N. Wood 2017; Marra and Wood 2011). This
avoids making potentially unreasonable assumptions about how space and
time interact in space-time varying coefficient models. The best model
can then be extracted an examined for the nature of the space-time
interactions it suggests, before generating the varying coefficient
estimates and mapping or plotting the results.
Future developments will seek to examine the scales of spatial
dependencies in the data sing kriging based approaches and will extend
the stgam
package to work with large data (i.e. to draw
from the functionality of the gamm
function in
mgcv
).