The Patient Rule Induction Method (PRIM) was introduced by Friedman and Fisher (1999). It is a technique from data mining for finding `interesting’ regions in high-dimensional data. We start with regression-type data (X1, Y1), …, (Xn, Yn) where Xi is d-dimensional and Yi is a scalar response variable. We are interested in the conditional expectation function
m(x) = E (Y | X = x).
In the case where we have a single sample then PRIM finds the bumps of m(x).
We use a subset of the MASS::Boston data set. It
contains housing data measurements for 506 towns in the Boston, USA
area. For the explanatory variables, we take the nitrogen oxides
concentration in parts per 10 million nox and the average
number of rooms per dwelling rm. The response is the per
capita crime rate crim. We are interested in characterising
those areas with higher crime rates in order to provide better support
infrastructure.
library(prim)
data(Boston, package="MASS")
x <- Boston[,5:6]
y <- Boston[,1]
boston.prim <- prim.box(x=x, y=y, threshold.type=1)The default settings for prim.box are
peel.alpha=0.05pasting=TRUEpaste.alpha=0.01mass.min=0.05threshold is the overall mean of the response variable
ythreshold.type=0We use the default settings except that we wish to only find high
crime areas {m(x) \(\ge\) threshold} so we set
threshold.type=1.
We view the output using a summary method. This displays
three columns: the box mean, the box mass, and the threshold type. Each
line is a summary for each box, as well as an overall summary. The box
which is asterisked indicates that it is the box which contains the rest
of the data not processed by PRIM. There is one box which contains
42.89% of the towns and where the average crime rate is 7.62. This is
our HDR estimate. This regions comprises the bulk of the high crime
areas, and is described in terms of nitrogen oxides levels in [0.53,
0.74] and average number of rooms in [3.04, 7.07]. The other 57.11% of
the towns have an average crime rate of 0.6035.
summary(boston.prim, print.box=TRUE)
#> box-fun box-mass threshold.type
#> box1 7.6222290 0.4288538 1
#> box2* 0.6035267 0.5711462 NA
#> overall 3.6135236 1.0000000 NA
#>
#> Box limits for box1
#> nox rm
#> min 0.5332 3.0391
#> max 0.7400 7.0718
#>
#> Box limits for box2
#> nox rm
#> min 0.3364 3.0391
#> max 0.9196 9.3019We plot the PRIM boxes, including all those towns whose crime rate exceeds 3.5. Thus verifying that the majority of high crime towns fall inside thebump.
There are many options for the graphical display. See the help guide
for more details ?plot.prim.
The default function applied to the response variable y
is mean. We can input another function,
e.g. median, using
We compare the results: the box for the mean is in black, for the median in red:
plot(boston.prim, col="transparent")
plot(boston.prim.med, col="transparent", border="red", add=TRUE)
legend("topleft", legend=c("mean", "median"), col=1:2, lty=1, bty="n")The covariate x can also include categorical variables:
we replace the average number of rooms per dwelling rm with
the index of accessibility to radial highways rad which
takes integral values from 1 to 24 inclusive.
x2 <- Boston[,c(5,9)]
y <- Boston[,1]
boston.cat.prim <- prim.box(x=x2, y=y, threshold.type=1)
summary(boston.cat.prim, print.box=TRUE)
#> box-fun box-mass threshold.type
#> box1 7.2629703 0.4822134 1
#> box2* 0.2148022 0.5177866 NA
#> overall 3.6135236 1.0000000 NA
#>
#> Box limits for box1
#> nox rad
#> min 0.5380 4.0
#> max 0.9196 26.3
#>
#> Box limits for box2
#> nox rad
#> min 0.3364 -1.3
#> max 0.9196 26.3
plot(boston.cat.prim, col="transparent")
points(x2[y>3.5,])In the case where we have 2 samples, we can label the response as a binary variable with
Yi = 1 if Xi is in sample 1
or
Yi = -1 if Xi is in sample 2.
Then PRIM finds the regions where the samples are most different. Here we have a positive HDR (where sample 1 points dominate) and a negative HDR (where sample 2 points dominate).
We look at a 3-dimensional data set quasiflow included
in the prim library. It is a randomly generated data set
from two normal mixture distributions whose structure mimics some light
scattering data, taken from a machine known as a flow cytometer.
library(prim)
data(quasiflow)
yflow <- quasiflow[,4]
xflow <- quasiflow[,1:3]
xflowp <- quasiflow[yflow==1,1:3]
xflown <- quasiflow[yflow==-1,1:3]We can think of xflowp as flow cytometric measurements
from an HIV+ patient, and xflown from an HIV– patient.
There are two ways of using prim.box to estimate where
the two samples are most different (or equivalently to estimate the HDRs
of the difference of the density functions). In the first way, we assume
that we have suitable values for the thresholds. Then we can use
{r}= qflow.thr <- c(0.38, -0.23) qflow.prim <- prim.box(x=xflow, y=yflow, threshold=qflow.thr, threshold.type=0)
An alternative is compute PRIM box sequences which cover the range of
the response variable y, and then use prim.hdr
to experiment with different threshold values. This two-step process is
more efficient and faster than calling prim.box for each
different threshold. We are happy with the positive HDR threshold so we
can compute the positive HDR directly:
qflow.hdr.pos <- prim.box(x=xflow, y=yflow, threshold=0.38, threshold.type=1)
summary(qflow.hdr.pos)
#> box-fun box-mass threshold.type
#> box1 0.551879699 0.05808875 1
#> box2* -0.003431327 0.94191125 NA
#> overall 0.028825996 1.00000000 NAOn the other hand, we are not sure about the negative HDR thresholds.
So we try several different values for threshold.
qflow.neg <- prim.box(x=xflow, y=yflow, threshold.type=-1)
qflow.hdr.neg1 <- prim.hdr(qflow.neg, threshold=-0.23, threshold.type=-1)
qflow.hdr.neg2 <- prim.hdr(qflow.neg, threshold=-0.43, threshold.type=-1)
qflow.hdr.neg3 <- prim.hdr(qflow.neg, threshold=-0.63, threshold.type=-1)After examining the summaries and plots,
summary(qflow.hdr.neg1)
#> box-fun box-mass threshold.type
#> box1 -0.6767677 0.05188679 -1
#> box2 -0.3109244 0.05197414 -1
#> box3 -0.2778316 0.09023410 -1
#> box4 -0.2815199 0.05057652 -1
#> box5* 0.1580895 0.75532844 NA
#> overall 0.0288260 1.00000000 NAwe choose qflow.hdr.neg1 to combine with
qflow.hdr.pos.
qflow.prim2 <- prim.combine(qflow.hdr.pos, qflow.hdr.neg1)
summary(qflow.prim2)
#> box-fun box-mass threshold.type
#> box1 0.5518797 0.05808875 1
#> box2 -0.6767677 0.05188679 -1
#> box3 -0.3109244 0.05197414 -1
#> box4 -0.2778316 0.09023410 -1
#> box5 -0.2815199 0.05057652 -1
#> box6* 0.1252819 0.69723969 NA
#> overall 0.0288260 1.00000000 NAIn the plot below, the positive HDR is coloured orange (where the HIV+ sample is more prevalent), and the negative HDR is coloured blue (where the HIV- sample is more prevalent)
or equivalently as a 3D scatter plot.
plot(qflow.prim2, x.pt=xflow, pch=16, cex=0.5, alpha=0.1, splom=FALSE, colkey=FALSE, ticktype="detailed")Friedman, J. H. and Fisher, N. I. (1999) Bump-hunting for high dimensional data. Statistics and Computing 9, 123–143.
– Generated 12 May 2026.