Multi Class vtreat

John Mount

2018-09-10

Problem: try to prepare data to model multi-class y as a function of x using vtreat. vtreat does not directly do this, but can be used to do this. In this note we share functions to make this adaption.

library("vtreat")

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are (as of version 1.2.4) part of vtreat.

Our specific example: try to model multi-class y as a function of x1 and x2.

# create example data
set.seed(326346)
sym_bonuses <- rnorm(3)
names(sym_bonuses) <- c("a", "b", "c")
sym_bonuses3 <- rnorm(3)
names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
n_row <- 1000
d <- data.frame(x1 = rnorm(n_row),
                x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
                x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
                y = "NoInfo",
                stringsAsFactors = FALSE)
d$y[sym_bonuses[d$x2] > pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
d$y[sym_bonuses3[d$x3] > pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"

knitr::kable(head(d))
x1 x2 x3 y
0.8178292 b 3 Large2
0.5867139 b 3 Large2
-0.6711920 a 3 Large2
0.1033166 c 2 NoInfo
-0.3182176 b 1 NoInfo
-0.5914308 c 2 NoInfo

We define the problem controls and use mkCrossFrameMExperiment() to build both a cross-frame and a treatment plan.

# define problem
vars <- c("x1", "x2", "x3")
y_name <- "y"
y_levels <- sort(unique(d[[y_name]]))

# build the multi-class cross frame and treatments
cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)

The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.

# look at the data we would train models on
str(cfe_m$cross_frame)
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.818 0.587 -0.671 0.103 -0.318 ...
##  $ x2_catP       : num  0.313 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.333 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  1 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  1 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  -11.23 -11.2 1.25 -11.41 -11.27 ...
##  $ Large1_x3_catB: num  -11.356 -11.239 -11.239 0.379 0.431 ...
##  $ Large2_x2_catB: num  0.0862 0.1446 -0.0243 -0.1268 0.0862 ...
##  $ Large2_x3_catB: num  4.98 6.09 4.69 -3.11 -13.86 ...
##  $ NoInfo_x2_catB: num  -0.0537 0.1084 -0.2827 0.2859 0.1084 ...
##  $ NoInfo_x3_catB: num  -4.82 -5.24 -4.83 2.13 2.53 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

prepare() can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.

# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
  d[[vi]][[1]] <- NA
}
str(prepare(cfe_m$treat_m, d))
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
##  $ x2_catP       : num  0.0005 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.0005 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  0 -11.6 1.2 -11.8 -11.6 ...
##  $ Large1_x3_catB: num  0 -11.702 -11.702 0.411 0.436 ...
##  $ Large2_x2_catB: num  0 0.133 -0.0215 -0.0999 0.133 ...
##  $ Large2_x3_catB: num  0 5.1 5.1 -3.54 -14.29 ...
##  $ NoInfo_x2_catB: num  0 0.0206 -0.2829 0.2536 0.0206 ...
##  $ NoInfo_x3_catB: num  0 -4.95 -4.95 2.11 2.34 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).

We to have a start on variable importance.

knitr::kable(cfe_m$score_frame)
varName varMoves rsq sig outcome_level needsSplit extraModelDegrees origName code
x1_clean TRUE 0.0558908 0.0000382 Large1 FALSE 0 x1 clean
x2_catP TRUE 0.0275238 0.0038536 Large1 TRUE 2 x2 catP
x2_lev_x_a TRUE 0.2680953 0.0000000 Large1 FALSE 0 x2 lev
x2_lev_x_b TRUE 0.0885021 0.0000002 Large1 FALSE 0 x2 lev
x2_lev_x_c TRUE 0.1060407 0.0000000 Large1 FALSE 0 x2 lev
x3_catP TRUE 0.0000346 0.9183445 Large1 TRUE 2 x3 catP
x3_lev_x_1 TRUE 0.0141504 0.0382554 Large1 FALSE 0 x3 lev
x3_lev_x_2 TRUE 0.0140364 0.0390420 Large1 FALSE 0 x3 lev
x3_lev_x_3 TRUE 0.0955004 0.0000001 Large1 FALSE 0 x3 lev
x1_clean TRUE 0.0015382 0.1615618 Large2 FALSE 0 x1 clean
x2_catP TRUE 0.0013055 0.1971725 Large2 TRUE 2 x2 catP
x2_lev_x_a TRUE 0.0000387 0.8242956 Large2 FALSE 0 x2 lev
x2_lev_x_b TRUE 0.0014571 0.1730603 Large2 FALSE 0 x2 lev
x2_lev_x_c TRUE 0.0009604 0.2686774 Large2 FALSE 0 x2 lev
x3_catP TRUE 0.0007725 0.3211959 Large2 TRUE 2 x3 catP
x3_lev_x_1 TRUE 0.2602002 0.0000000 Large2 FALSE 0 x3 lev
x3_lev_x_2 TRUE 0.2483708 0.0000000 Large2 FALSE 0 x3 lev
x3_lev_x_3 TRUE 0.9197595 0.0000000 Large2 FALSE 0 x3 lev
x1_clean TRUE 0.0064771 0.0034947 NoInfo FALSE 0 x1 clean
x2_catP TRUE 0.0040540 0.0208595 NoInfo TRUE 2 x2 catP
x2_lev_x_a TRUE 0.0071709 0.0021196 NoInfo FALSE 0 x2 lev
x2_lev_x_b TRUE 0.0000340 0.8323647 NoInfo FALSE 0 x2 lev
x2_lev_x_c TRUE 0.0060493 0.0047665 NoInfo FALSE 0 x2 lev
x3_catP TRUE 0.0006576 0.3520950 NoInfo TRUE 2 x3 catP
x3_lev_x_1 TRUE 0.1838759 0.0000000 NoInfo FALSE 0 x3 lev
x3_lev_x_2 TRUE 0.1857824 0.0000000 NoInfo FALSE 0 x3 lev
x3_lev_x_3 TRUE 0.7372570 0.0000000 NoInfo FALSE 0 x3 lev
Large1_x2_catB TRUE 0.2675964 0.0000000 Large1 TRUE 2 x2 catB
Large1_x3_catB TRUE 0.0946910 0.0000001 Large1 TRUE 2 x3 catB
Large2_x2_catB TRUE 0.0000291 0.8472707 Large2 TRUE 2 x2 catB
Large2_x3_catB TRUE 0.9239860 0.0000000 Large2 TRUE 2 x3 catB
NoInfo_x2_catB TRUE 0.0068238 0.0027207 NoInfo TRUE 2 x2 catB
NoInfo_x3_catB TRUE 0.7326682 0.0000000 NoInfo TRUE 2 x3 catB

One can relate these per-target and per-treatment performances back to original columns b aggregating.

tapply(cfe_m$score_frame$rsq, cfe_m$score_frame$origName, max)
##         x1         x2         x3 
## 0.05589076 0.26809534 0.92398602
tapply(cfe_m$score_frame$sig, cfe_m$score_frame$origName, min)
##            x1            x2            x3 
##  3.819834e-05  1.892838e-19 5.746904e-258