Introduction to flevr

Brian D. Williamson

library(flevr)

Introduction

flevr is a R package for doing variable selection based on flexible ensembles. The package provides functions for extrinsic variable selection using the Super Learner and for intrinsic variable selection using the Shapley Population Variable Importance Measure (SPVIM).

The author and maintainer of the flevr package is Brian Williamson. For details on the method, check out our preprint.

Installation

You can install a development release of flevr from GitHub via devtools by running the following code:

# install devtools if you haven't already
# install.packages("devtools", repos = "https://cloud.r-project.org")
devtools::install_github(repo = "bdwilliamson/flevr")

Quick start

This section should serve as a quick guide to using the flevr package — we will cover the main functions for doing extrinsic and intrinsic variable selection using a simulated data example. More details are given in the specific vignettes for extrinsic selection and intrinsic selection.

First, we create some data:

# generate the data -- note that this is a simple setting, for speed
set.seed(4747)
p <- 2
n <- 500
# generate features
x <- replicate(p, stats::rnorm(n, 0, 1))
x_df <- as.data.frame(x)
x_names <- names(x_df)
# generate outcomes
y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)

This creates a matrix of covariates x with 2 columns and a vector y of normally-distributed outcome values for a sample of n = 500 study participants.

There are two main types of variable selection available in flevr: extrinsic and intrinsic. Extrinsic selection is the most common type of variable selection: in this approach, a given algorithm (and perhaps its associated algorithm-specific variable importance) is used for variable selection. The lasso is a widely-used example of extrinsic selection. Intrinsic selection, on the other hand, uses estimated intrinsic variable importance (a population quantity) to perform variable selection. This intrinsic importance is both defined and estimated in a model-agnostic manner.

Extrinsic variable selection

We recommend using the Super Learner (ref?)(vanderlaan2007) to do extrinsic variable selection to protect against model misspecification; more details on this procedure are available in the vignette on extrinsic selection. This requires specifying a library of candidate learners (e.g., lasso, random forests). We can do this in flevr using the following code:

set.seed(1234)
# fit a Super Learner ensemble; note its simplicity, for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
V <- 2
fit <- SuperLearner::SuperLearner(Y = y, X = x_df,
                                  SL.library = learners,
                                  cvControl = list(V = V))
# extract importance based on the whole Super Learner
sl_importance_all <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "all"
)
sl_importance_all
#> # A tibble: 2 × 2
#>   feature  rank
#>   <chr>   <dbl>
#> 1 V2       1.01
#> 2 V1       1.99

These results suggest that feature 2 is more important than feature 1 within the Super Learner ensemble (since a lower rank is better). If we want to scrutinize the importance of features within the best-fitting algorithm in the Super Learner ensemble, we can do the following:

sl_importance_best <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "best"
)
sl_importance_best
#> # A tibble: 2 × 2
#>   feature  rank
#>   <chr>   <int>
#> 1 V2          1
#> 2 V1          2

Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only two variables, we choose a threshold of 1.5, which means we will select only one variable:

extrinsic_selected <- extrinsic_selection(
  fit = fit, feature_names = x_names, threshold = 1.5, import_type = "all"
)
extrinsic_selected
#> # A tibble: 2 × 3
#>   feature  rank selected
#>   <chr>   <dbl> <lgl>   
#> 1 V2       1.01 TRUE    
#> 2 V1       1.99 FALSE

In this case, we select only variable 2.

Intrinsic variable selection

Intrinsic variable selection is based on population variable importance (ref?)(williamson2020c); more details on this procedure are available in the vignette on intrinsic selection. Intrinsic selection also uses the Super Learner under the hood, and requires specifying a useful measure of predictiveness (e.g., R-squared or classification accuracy). The first step in doing intrinsic selection is estimating the variable importance:

set.seed(1234)

# set up a library for SuperLearner
learners <- "SL.glm"
univariate_learners <- "SL.glm"
V <- 2

# estimate the SPVIMs
library("vimp")
est <- suppressWarnings(
  sp_vim(Y = y, X = x, V = V, type = "r_squared",
              SL.library = learners, gamma = .1, alpha = 0.05, delta = 0,
              cvControl = list(V = V), env = environment())
)
est
#> Variable importance estimates:
#>       Estimate  SE         95% CI                  VIMP > 0 p-value     
#> s = 1 0.1515809 0.06090463 [0.03221005, 0.2709518] TRUE     1.330062e-03
#> s = 2 0.2990449 0.06565597 [0.17036157, 0.4277282] TRUE     6.863052e-09

This procedure again shows (correctly) that variable 2 is more important than variable 1 in this population.

The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, we choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate:

intrinsic_set <- intrinsic_selection(
  spvim_ests = est, sample_size = n, alpha = 0.2, feature_names = x_names,
  control = list( quantity = "gFWER", base_method = "Holm", k = 1)
)
intrinsic_set
#> # A tibble: 2 × 6
#>   feature   est       p_value adjusted_p_value  rank selected
#>   <chr>   <dbl>         <dbl>            <dbl> <dbl> <lgl>   
#> 1 V1      0.152 0.00133           0.00133          2 TRUE    
#> 2 V2      0.299 0.00000000686     0.0000000137     1 TRUE

In this case, we select both variables.