Type: Package
Title: Solving Imbalanced Regression Tasks
Version: 0.1.4
Description: Imbalanced domain learning has almost exclusively focused on solving classification tasks, where the objective is to predict cases labelled with a rare class accurately. Such a well-defined approach for regression tasks lacked due to two main factors. First, standard regression tasks assume that each value is equally important to the user. Second, standard evaluation metrics focus on assessing the performance of the model on the most common cases. This package contains methods to tackle imbalanced domain learning problems in regression tasks, where the objective is to predict extreme (rare) values. The methods contained in this package are: 1) an automatic and non-parametric method to obtain such relevance functions; 2) visualisation tools; 3) suite of evaluation measures for optimisation/validation processes; 4) the squared-error relevance area measure, an evaluation metric tailored for imbalanced regression tasks. More information can be found in Ribeiro and Moniz (2020) <doi:10.1007/s10994-020-05900-9>.
URL: https://github.com/nunompmoniz/IRon
BugReports: https://github.com/nunompmoniz/IRon/issues
License: CC0
Encoding: UTF-8
LazyData: true
LinkingTo: Rcpp
Depends: R (≥ 2.10)
Imports: Rcpp, stats, ggpubr, gridExtra, ggplot2, robustbase, scam
Suggests: rpart, e1071, earth, randomForest, mgcv, reshape
RoxygenNote: 7.2.1
NeedsCompilation: yes
Packaged: 2023-01-19 16:31:27 UTC; admin
Author: Nuno Moniz [cre, aut], Rita P. Ribeiro [aut], Miguel Margarido [ctb]
Maintainer: Nuno Moniz <nmoniz2@nd.edu>
Repository: CRAN
Date/Publication: 2023-01-20 07:20:06 UTC

NO2Emissions

Description

The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable (column 1) consist of hourly values of the logarithm of the concentration of NO2 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables (columns 2 to 8) are the logarithm of the number of cars per hour, temperature $2$ meter above ground (degree C), wind speed (meters/second), the temperature difference between $25$ and $2$ meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001.

Usage

data(NO2Emissions)

Format

A "data.frame" structure with 500 observations, 8 numerical variables

Source

StatLib

Examples

data(NO2Emissions)
head(NO2Emissions)

Acceleration

Description

Dataset with acceleration target value w.r.t. 14 nominal and numerical variables

Usage

data(accel)

Format

A "data.frame" structure with 1732 observations, 3 nominal and 11 numerical predictor variables

Source

UCI Archive

References

Hadi Fanaee-T. and João Gama. Event labeling combining ensemble detectors and background knowledge. Prog. in Art. Int., pages 1-15, 2013. ISSN 2192-6352. (Springer)

Examples

data(accel)
head(accel)

Model Bias

Description

Model Bias

Usage

bias(trues, preds)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for model bias


Pearson's Correlation

Description

Pearson's Correlation

Usage

corr(trues, preds)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for the Pearson's correlation


Predictive Modelling Evaluation Statistics

Description

Evaluation statistics including standard and non-standard evaluation metrics. Returns a structure of data containing the results of several evaluation metrics (both standard and some focused on the imbalanced regression problem).

Usage

eval.stats(formula, train, test, y_pred, phi.parms = NULL, cf = 1.5)

Arguments

formula

A model formula

train

A data.frame object with the training data

test

A data.frame object with the test set

y_pred

A vector with the predictions of a given model

phi.parms

The relevance function providing the data points where the pairs of values-relevance are known (use ?phi.control() for more information). If this parameter is not defined, this method will create a relevance function based on the data.frame variable in parameter train. Default is NULL

cf

The coefficient used to calculate the boxplot whiskers in the event that a relevance function is not provided (parameter phi.parms)

Value

A list with four slots for the results of standard and relevance-based evaluation metrics

overall

Results for standard metrics MAE, MSE and RMSE, along with Pearson's Correlation, bias, variance and the Squared Error Relevance Area metric.

Examples

library(IRon)

if(requireNamespace("earth")) {

   data(accel)

   form <- acceleration ~ .

   ind <- sample(1:nrow(accel),0.75*nrow(accel))

   train <- accel[ind,]
   test <- accel[-ind,]

   ph <- phi.control(accel$acceleration)

   m <- earth::earth(form, train)
   preds <- as.vector(predict(m,test))

   eval.stats(form, train, test, preds)
   eval.stats(form, train, test, preds, ph)
   eval.stats(form, train, test, preds, ph, cf=3) # Focusing on extreme outliers

}



Standard Evaluation Metrics

Description

Mean Average Error

Usage

mae(trues, preds)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for the mean average error


Mean Squared Error

Description

Mean Squared Error

Usage

mse(trues, preds)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for the mean squared error


Obtain the relevance of data points

Description

The phi function retrieves the relevance value of the values in a target variable. It does so by resorting to the Piecewise Cubic Hermite Interpolation Polynomial method for interpolating over a set of maximum and minimum relevance points. The notion of relevance is associated with rarity.Nonetheless, this notion may depend on the domain experts knowledge

Usage

phi(y, phi.parms = NULL)

Arguments

y

The target variable of a given data set

phi.parms

The relevance function providing the data points where the pairs of values-relevance are known

Value

A vector with the relevance values of a given target variable

Examples

library(IRon)
data(accel)

ind <- sample(1:nrow(accel),0.75*nrow(accel))

train <- accel[ind,]
test <- accel[-ind,]

ph <- phi.control(train$acceleration)
phis <- phi(test$acceleration,phi.parms=ph)

plot(test$acceleration,phis,xlab="Y",ylab="Relevance")

Generation of relevance function

Description

This procedure enables the generation of a relevance function that performs a mapping between the values in a given target variable and a relevance value that is bounded by 0 (minimum relevance) and 1 (maximum relevance). This may be obtained automatically (based on the distribution of the target variable) or by the user defining the relevance values of a given set of target values - the remaining values will be interpolated.

Usage

phi.control(
  y,
  phi.parms,
  method = phiMethods,
  extr.type = NULL,
  control.pts = NULL,
  asym = TRUE,
  ...
)

Arguments

y

The target variable of a given data set

phi.parms

The relevance function providing the data points where the pairs of values-relevance are known

method

The method used to generate the relevance function (extremes or range)

extr.type

Type of extremes to be considered: low, high or both (default)

control.pts

Parameter required when using 'range' method, representing a 3-column matrix of y-value, corresponding relevance value (between 0 and 1), and the derivative of such relevance value

asym

Boolean for assymetric interpolation. Default TRUE, uses adjusted boxplot. When FALSE, uses standard boxplot.

...

Misc data to be added to the relevance function

Value

A list with three slots with information concerning the relevance function

method

The method used to generate the relevance function (extremes or range)

npts

?

control.pts

Three sets of values identifying the target value-relevance-derivate for the first low extreme value, the median, and first high extreme value

Examples

library(IRon)

data(accel)

ind <- sample(1:nrow(accel),0.75*nrow(accel))

train <- accel[ind,]
test <- accel[-ind,]

ph <- phi.control(train$acceleration); phiPlot(test$acceleration, ph)
ph <- phi.control(train$acceleration, extr.type="high"); phiPlot(test$acceleration, ph)
ph <- phi.control(train$acceleration, method="range",
  control.pts=matrix(c(10,0,0,15,1,0),byrow=TRUE,ncol=3)); phiPlot(test$acceleration, ph)


Relevance function for extreme target values

Description

Automatic approach to obtain a relevance function for a given target variable when the option of extremes is chosen, i.e. users are more interested in accurately predicting extreme target values

Usage

phi.extremes(
  y,
  extr.type = c("both", "high", "low"),
  coef = 1.5,
  asym = TRUE,
  ...
)

Arguments

y

The target variable of a given data set

extr.type

Type of extremes to be considered: low, high or both (default)

coef

Boxplot coefficient (default 1.5)

asym

Boolean for assymetric interpolation. Default TRUE, uses adjusted boxplot. When FALSE, uses standard boxplot.

...

Additional parameters

Value

A list with three slots with information concerning the relevance function

method

The method used to generate the relevance function (extremes or range)

npts

?

control.pts

Three sets of values identifying the target value-relevance-derivate for the first low extreme value, the median, and first high extreme value


Custom Relevance Function

Description

User-guided approach to obtain a relevance function for certain intervals of the target variable when the option of range is chosen in function phi.control, i.e. users define the relevance of values for which it is known

Usage

phi.range(y, control.pts, ...)

Arguments

y

The target variable of a given data set

control.pts

Parameter representing a 3-column matrix of y-value, corresponding relevance value (between 0 and 1), and the derivative of such relevance value, allowing users to specify the known relevance at given target values

...

Additional parameters

Value

A list with three slots with information concerning the relevance function

method

The method used to generate the relevance function (extremes or range)

npts

?

control.pts

Three sets of values identifying the target value-relevance-derivate for the first low extreme value, the median, and first high extreme value


Plot of phi versus y and boxplot of y

Description

The phiPlot function uses a dataset ds containing many y values to produce a line plot of phi versus y and a boxplot of y, and aligns them, one above the other. The first extreme value on either side of the boxplot should correspond to the point where phi becomes exactly 1 on the line plot. This function is dependent on the robustbase, ggplot2 and ggpubr packages, and will not work without them.

Usage

phiPlot(ds, phi.parms = NULL, limits = NULL, xlab = "y", ...)

Arguments

ds

Dataset of y values

phi.parms

The relevance function providing the data points where the pairs of values-relevance are known. Default is NULL

limits

Vector with values to draw limits. Default is NULL

xlab

Label of the x axis. Default is y

...

Extra parameters when deriving the relevance function

Value

A line plot of phi versus y, as well as a boxplot of y

Examples

ds <- rnorm(1000, 30, 10); phi.parms <- phi.control(ds); phiPlot(ds,phi.parms)
ds <- rpois(100,3); phiPlot(ds)


Root Mean Squared Error

Description

Root Mean Squared Error

Usage

rmse(trues, preds)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for the relevance-weighted root mean squared error


Non-Standard Evaluation Metrics

Description

Obtains the squared error of predictions for a given subset of relevance

Usage

ser(trues, preds, phi.trues = NULL, ph = NULL, t = 0)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable preds

phi.trues

Relevance of the values in the parameter trues. Use ??phi() for more information. Defaults to NULL

ph

The relevance function providing the data points where the pairs of values-relevance are known. Default is NULL

t

Relevance cut-off. Default is 0.

Details

Squared Error-Relevance Metric (SER)

Value

Squared error for for cases where the relevance of the true value is greater than t (SERA)

Examples

library(IRon)
library(rpart)

if(requireNamespace("rpart")) {

   data(accel)

   form <- acceleration ~ .

   ind <- sample(1:nrow(accel),0.75*nrow(accel))

   train <- accel[ind,]
   test <- accel[-ind,]

   ph <- phi.control(accel$acceleration)

   m <- rpart::rpart(form, train)
   preds <- as.vector(predict(m,test))

   trues <- test$acceleration
   phi.trues <- phi(test$acceleration,ph)

   ser(trues,preds,phi.trues)

}


Squared Error-Relevance Area (SERA)

Description

Computes an approximation of the area under the curve described by squared error of predictions for a sequence of subsets with increasing relevance

Usage

sera(
  trues,
  preds,
  phi.trues = NULL,
  ph = NULL,
  pl = FALSE,
  m.name = "Model",
  step = 0.001,
  return.err = FALSE,
  norm = FALSE
)

Arguments

trues

Target values from a test set of a given data set. Should be a vector and have the same size as the variable preds

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable preds

phi.trues

Relevance of the values in the parameter trues. Use ??phi() for more information. Defaults to NULL

ph

The relevance function providing the data points where the pairs of values-relevance are known. Default is NULL

pl

Boolean to indicate if an illustration of the curve should be provided. Default is FALSE

m.name

Name of the model to be appended in the plot title

step

Relevance intervals between 0 (min) and 1 (max). Default 0.001

return.err

Boolean to indicate if the errors at each subset of increasing relevance should be returned. Default is FALSE

norm

Normalize the SERA values for internal optimisation only (TRUE/FALSE)

Value

Value for the area under the relevance-squared error curve (SERA)

Examples

library(IRon)
library(rpart)

if(requireNamespace("rpart")) {

   #' data(accel)

   form <- acceleration ~ .

   ind <- sample(1:nrow(accel),0.75*nrow(accel))

   train <- accel[ind,]
   test <- accel[-ind,]

   ph <- phi.control(accel$acceleration)

   m <- rpart::rpart(form, train)
   preds <- as.vector(predict(m,test))

   trues <- test$acceleration
   phi.trues <- phi(test$acceleration,ph)

   sera(trues,preds,phi.trues)
   sera(trues,preds,phi.trues,pl=TRUE, m.name="Regression Trees")
   sera(trues,preds,phi.trues,pl=TRUE, return.err=TRUE)

}


Model Variance

Description

Model Variance

Usage

variance(preds)

Arguments

preds

Predicted values given a certain test set of a given data set. Should be a vector and have the same size as the variable trues

Value

Value for model variance