Rprofet

Introduction

This paper demonstrates how each of the functions works in the Rprofet package while presenting a workflow of credit scoring by using the resources in this package. The workflow starts with data splitting and goes through each function using the Lending Club data. The process generally follows variable binning, applying Weight of Evidence (WOE) transformation and calculating information value (IV) for each variable, variable selection, visualize the selected variables and fine-tuning if necessary. Then further variable selection can be done again and a scorecard will be built after fitting a logistic regression model with the selected variables. After doing all these steps on the training data, we can apply the scorecard to the validation data.

The functions in the package are listed below:

  1. Variable Binning
    • BinProfet
  2. Weight of Evidence Transformation and Information Value
    • WOEProfet
  3. Variable Selection
    • WOEclust_hclust
    • WOEclust_kmeans
    • Var_select
  4. Visualization and Fine-tuning Variables
    • WOEplotter
    • WOE_customNum
    • WOE_customFac
  5. Scorecard and Scoring Validation data
    • ScorecardProfet
    • ScoreDataProfet

The dataset we will be using to illustrate the workflow of the Rprofet package is called Lending Club data. LendingClub is a lending marketplace, it is a peer-to-peer lending platform that connects investors with potential borrowers and offers both personal and business loans. Instead of getting loans from a bank or financial institution, an investor funds a borrower’s loan as a lender. The Lending Club data consists of 20000 observations and 74 variables. Each observation represents a borrower, and the characteristics have various information relating to the borrower. The variable “bad” is our target variable and the “ID” variable is unique for each observation. A target variable and an ID variable are required in order to use almost all functions in the Rprofet package. In this data, an observation with target variable “bad” being 1 represents a loan default. The independent variables can be a mixture of character, factor, continuous, and binary data. It is important to note that this package assumes unique IDs, and all categorical data should be converted to factor data before using the functions. The goal of the Rprofet package is to allow the user control over each step of the credit scoring process when building a scorecard and applying the scorecard to a validation data while still being flexible enough to handle the “messy” data. The “messy” data refers to data with multiple variable types, missing values, outliers, and special values, and such data is common in credit industry. Table \(\ref{tab:data-preview}\) shows the snapshot of the first 10 columns of the Lending Club data. For our workflow example, we first split the data into training and validation datasets with an 80:20 ratio. We will be building a scorecard based on the training set and using the scorecard to score the validation set. Table \(\ref{tab:data-set-summary}\) shows the dimension and bad rate of the datasets.

Head of first 10 columns of lending club data
ID bad emp_length home_ownership annual_inc verification_status purpose dti delinq_2yrs inq_last_6mths
54901 0 10+ years MORTGAGE 81600 Verified moving 12.79 0 0
178431 1 4 years RENT 24000 Source Verified debt_consolidation 10.60 0 0
163565 0 5 years MORTGAGE 85000 Not Verified car 11.94 0 0
44299 1 n/a OWN 30000 Verified medical 34.21 1 2
127976 0 5 years RENT 37000 Source Verified debt_consolidation 19.49 0 0
92363 0 3 years MORTGAGE 48800 Not Verified credit_card 20.59 0 0
Dimension of datasets
Dataset Rows Columns TargetRate
Original 20000 74 0.116
Training 16000 74 0.118
Validation 4000 74 0.108

Coarse Binning of the variables

Binning variables is a process of transforming variables by dividing the values of each variable into fewer groups. The groups should make logical sense or have the same relative risk. This process is also referred to as coarse classing or grouping [\(\ref{Anderson}\)]. Variable binning converts continuous variables to categorical. For categorical variables, they already have the initial bins. Thus, all types of variables are transformed into categorical variables after binning process.

An advantage of binning variables is that it reduces the inconsistencies within data such as non-linear relationships, missing values, and outliers. Missing values and outliers are commonly found in financial industry data and caused by various reasons. For example, figure \(\ref{fig:var-bf-bin}\) shows the distribution of the “Month since recent inquiry” variable in the training data is right skewed. In addition, there are 1793 missing values in the variable that are not recognizable in the figure.

Original Distribution of Month since recent inquiry
Original Distribution of Month since recent inquiry

Since many statistical models use complete case approaches, we need to take care of the missing values in our datasets. In most cases, missing values in financial data are not random, they might represent part of some characteristics or trends. There are many ways to handle the missing data, a good practice is to group the missing values together and treat it as a separate attribute. Assigning special values to the missing values is another imputation method that may reduce information lost. Therefore, the missing values can be included in logistic regression models and assigned points in the scorecard. Outliers may be typing errors, extreme cases, or even frauds. They usually affect the regression analysis negatively and thus should be investigated first [\(\ref{Siddiqi}\)].

There are a variety of different binning techniques available. More powerful predictor variables tend to have greater number of potential coarse bins. Thus, creating too few bins for a characteristic can lose important information. On the contrary, creating too many bins can increase the chance of overfitting. According to Anderson [\(\ref{Anderson}\)], the goal of variable binning is to keep as few bins as possible while minimize the information loss in variables. The number of bins for a variable should not exceed 10 in most cases, and the bins are not necessarily equal sized. However, grouping variables into equal sized bins is a common practice. In addition, variables can also be grouped into bins with evenly spaced intervals based on the minimum and maximum values of the variable. The binning method in this package uses the greedy binning algorithm from the binr package [\(\ref{binr}\)] that sorts values of an variable first and then fills the target number of bins with the values until they contain about the same number of observations. In this way, we are able to establish a basic relationship between the response and predictor variables and eliminate unimportant information in data such as outliers. One important consideration is making sure the number of cases in each group is sufficient. The definition of “sufficient” varies and it requires some judgments by users. Anderson referred to two standards for sufficient binning, one states that there should be at least 40 of each outcome per coarse class while another states that each bin should contain a minimum of 5% of the population. If there are too few observations within a bin, joining similar groups can help to avoid the issue. The customized binning functions in later section are able to handle this issue. After binning, each characteristic should make logical sense, and each continuous characteristic usually have a monotonic relationship with the response function such as target rate.

When using the function, multiple inputs for general shaping of the coarse bins are provided. For instance, we wish to specify that we would like each of the numeric variables to have \(10\) bins and also would like each bin to have a minimum of \(200\) observations. Using the specified parameters, we then can perform coarse binning with the function.

binData = BinProfet(data = train, id = "ID", target = "bad", num.bins = 10,
                        min.pts.bin = 200)
Head of first 10 columns of the binned data
ID bad emp_length_Bins home_ownership_Bins annual_inc_Bins verification_status_Bins purpose_Bins dti_Bins delinq_2yrs_Bins inq_last_6mths_Bins
54901 0 10+ years MORTGAGE [73061,85250) Verified moving [10.58,13.14) [0,1) [0,1)
178431 1 4 years RENT [6000,32112) Source Verified debt_consolidation [10.58,13.14) [0,1) [0,1)
163565 0 5 years MORTGAGE [73061,85250) Not Verified car [10.58,13.14) [0,1) [0,1)
44299 1 n/a OWN [6000,32112) Verified medical [30.45, Inf) [1,2) [2,3)
127976 0 5 years RENT [32112,40019) Source Verified debt_consolidation [17.83,20.33) [0,1) [0,1)
92363 0 3 years MORTGAGE [48615,55050) Not Verified credit_card [20.33,23.09) [0,1) [0,1)

Table \(\ref{tab:head-of-binned-data}\) is a snapshot of the output of the function. It returns a data frame with binned continuous values while making the factor variables stay the same. The missing values of each variables were treated as a separate attribute called “Missing.” After binning, all variables have factor data type which handles the inconsistencies of the data type.

Distribution of binned Months since recent inquiry
Distribution of binned Months since recent inquiry

Figure \(\ref{fig:var-af-bin}\) shows the frequency of each bin for the “Months since recent inquiry” variable. We can see that the function will generally create approximately equal sized bins as the number of observations in each of the bins are close. Although the function tends to output the specified number of equal sized bins, this is not always the case as we see the second last bin has the smallest number of observations. In this case, we would consider combining the second last bin with the third last bin. Comparing figure \(\ref{fig:var-af-bin}\) to figure \(\ref{fig:var-bf-bin}\), the binned variable is more evenly distributed and the missing values are binned into its own group. The uneven bin size is due to the original distribution of this variable, as we can see most observations are concentrated on values between 0 and 5 and then the number of observations decreases as months since recent inquiry increases. Sometimes the function outputs less or more bins than the specified number of bins, this happens because the bins will likely contain many outliers or values that would skew the distribution significantly. Even when the distribution of a variable is symmetric, the greedy binning algorithm may not create symmetric bins if there are multiple tied values. When the binning algorithm could not allocate all values into the specified number of bins, it tends to create extra bins. This situation would lead to less observations than the specified minimum number in the extra bins. When the input number of bins or minimum points in each bin are too large for a variable, the binning algorithm would tend to create less bins than the specified attribute or ignore the minimum points bucket. Users might want to adjust the attributes if the majority of variables was not binned as expected, or fine tune some specific variables such as combining or separating certain bins using the customize binning functions mentioned in later section.

Another binning algorithm that is quantile-based performs similarly to the greedy algorithm. The two binning methods would create different bins for the numerical variables, but the two logistic models built in the end have similar predicting power. We decided to use the greedy binning method for , and when time allows we will be implementing a binning function using a quantile-based method.

WOE Transformation of the Bins

After binning the mixture types of raw data into categorical variables, the categories can then be converted to a continuous scale by performing Weight of Evidence (WOE) transformation for each bin. In credit scoring, WOE is commonly expressed as \[WOE_i = \ln \left[ \frac{ p\left(i \mid y=1\right)} { p\left(i \mid y=0\right) } \right],\] where \(p\left(i \mid y=1\right)\) is the probability that an observation is in the \(i^{th}\) bin given that the observation corresponds to a response variable of \(1\), or an event that is true. Similarly, \(p\left(i \mid y=0\right)\) is the probability that an observation is in the \(i^{th}\) bin given that the observation corresponds to a response variable of \(0\), or an event that is false. Therefore, if \(\text{WOE}_i\) is positive, then there is a higher probability of being in bin \(i\) with a response variable being \(1\) than a response variable being \(0\), and vice versa. Figure \(\ref{fig:WoePlot}\) shows the WOE values with the corresponding bins for variable “Months since recent inquiry.” The WOE values would replace the values of the variable in the original data. For example, if an observation had “Months since recent inquiry” equals \(16\), it is replaced with \(-0.367\).

WOE transformed Months since recent inquiry for each bin
WOE transformed Months since recent inquiry for each bin

The WOE for a bin is a calculation measuring the log likelihood of being in bin \(i\) given the value of the response variable. Let \(\sum{0_i}\) and \(\sum{0_T}\) be the sum of 0’s in bin \(i\) and the sum of 0’s in the entire dataset respectively. Similarly, let \(\sum{1_i}\) and \(\sum{1_T}\) be the sum of 1’s in bin \(i\) and the entire dataset. Then

\[\begin{align} \label{eq:woe} WOE_i &= \ln \left[ \frac{ p\left(i \mid y=1\right)} { p\left(i \mid y=0\right) } \right] \nonumber\\ &= \ln \left[ \frac{\sum{1_i}/\sum{1_T}}{\sum{0_i}/\sum{0_T}} \right] = \ln \left[ \frac{\sum{1_i}\sum{0_T}}{\sum{0_i}\sum{1_T}} \right] = \ln \left[ \frac{\sum{1_i}}{\sum{0_i}} \bigg/ \frac{\sum{1_T}}{\sum{0_T}} \right] \nonumber\\ &= \ln \left[ \frac{\sum{1_i}}{\sum{0_i}}\right] - \ln \left[ \frac{\sum{1_T}}{\sum{0_T}} \right] \end{align}\]

From equation (\(\ref{eq:woe}\)), we can see that the closer the WOE for a group is to \(0\), the closer the odds of the target for that group is to the overall sample odds. The larger the absolute value of the Weight of Evidence is for a bin, the more dissimilar the group is to the overall sample in respect of odds of the target. If \(\text{WOE}_i\) is positive, then the odds for that group is larger than the overall sample odds. Similarly, if \(\text{WOE}_i\) is negative, then the odds for that group is smaller than the overall sample odds.

Let \(T_i\) and \(T\) be the total number of observations in group \(i\) and the entire dataset respectively. Then equation (\(\ref{eq:woe}\)) can be written as

\[\begin{align} \label{eq:woe2} WOE_i &= \ln \left[ \frac{\sum{1_i}}{\sum{0_i}}\right] - \ln \left[ \frac{\sum{1_T}}{\sum{0_T}} \right] \nonumber \\ &= \ln\left[ \frac{\sum{1_i}/T_i}{\sum{0_i}/Ti}\right] - \ln\left[ \frac{\sum{1_T}/T}{\sum{0_T}/T}\right] \nonumber \\ &= \ln \left[ \frac{p(y=1|i)}{1-p(y=1|i)} \right] - \ln \left[ \frac{p(y=1)}{1-p(y=1)} \right] \end{align}\]

Equation (\(\ref{eq:woe2}\)) shows that the Weight of Evidence and the logit function have a linear relationship. This makes the Weight of Evidence transformed variables being more appropriate for a logistic regression model. In addition, from equation (\(\ref{eq:woe2}\)), \(\text{WOE}_i\) can be interpreted as the change in log-odds of an event occurring for a given group, it accounts for the number of events in a characteristic and the number of events in the overall sample [\(\ref{Ryan}\)]. A group with higher WOE corresponds to a higher probability of an event occurring. If the overall sample log-odds is small, , the group will be assigned to a higher WOE.

Assuming we want to calculate the WOE for a binary response variable. Let \(O_i\) denote the number of ones and \(Z_i\) denote the number of zeros in the \(i\text{-th}\) group. Let \(O_T\) and \(Z_T\) denote the total ones and zeros respectively in the overall sample. By Bayes’ theorem, \(\text{WOE}_i\) can also be written as

\[\begin{align*} WOE_i &= \ln \left[ \frac{ p\left(i \mid y=1\right)} { p\left(i \mid y=0\right) } \right] \\ &= \ln \left[\frac{p(y=1 \mid i) \cdot p(i)}{p(y=1)} \div \frac{p(y=0 \mid i) \cdot p(i)}{p(y=0)} \right] \\ &= \ln \left[\frac{p(y=1 \mid i)}{p(y=1)} \div \frac{p(y=0 \mid i)}{p(y=0)} \right]. \end{align*}\]

Then the \(\text{WOE}_i\) for the binary response variable can be estimated as

\[\begin{align} \label{eq:woe_0} WOE_i &= \ln \left[\frac{O_i/(O_i+Z_i)}{O_T/(O_T+Z_T)} \div \frac{Z_i/(O_i+Z_i)}{Z_T/(O_T+Z_T)} \right] \nonumber \\ &= \ln\left(\frac{O_i/O_T}{Z_i/Z_T}\right). \end{align}\]

The estimation in equation (\(\ref{eq:woe_0}\)) indicates that a group with response variable with all ones or all zeros would have an undefined Weight of Evidence. If this happens, one way to fix this problem is to use the bias correction formula [\(\ref{Scallan}\)]

\[\begin{equation} \label{eq:woe_c} WOE_i^c = ln\left( \frac{O_i+0.5/O_T+0.5}{Z_i+0.5/Z_T+0.5} \right). \end{equation}\]

After WOE transformation, the values of all variables are continuous which makes the values easier to interpret intuitively. In Rprofet, we use the in order to complete WOE transformation for binned variables.

WOEdata = WOEProfet(data = binData, id = "ID", target = "bad")

This function returns four items, which will be discussed later, and each of them are important for the rest of the credit scoring process. When using a WOE transformation, we can calculate each variable’s information value and determine how much information the variable provides when predicting the target variable.

Information Value and Variable Importance

Entropy

Some basic concepts of information theory would help us understand variable importance and information value. Information theory provides answers in communication theory such as the ultimate data compression and transmission rate of communication. Furthermore, it contributes to many other fields such as statistical physics, computer science, statistical inference, and probability and statistics [\(\ref{Cover&Thomas}\)]. An important term of information theory is Entropy, it measures the uncertainty of a single random variable. Since all the predictor variables in our case are transformed into Weight of Evidence, the random variables can be treated as discrete variables. Thus we will be focusing on the discrete random variables case in this section. The entropy \(H(X)\) of a discrete random variable \(X\) is

\[H(X) = - \sum_{x \in X}{p(x)\log_2 p(x)},\] where \(p(x)\) is the probability mass function of \(X\) [\(\ref{Cover&Thomas}\)]. Entropy is measured in bits when the base of the logarithm is 2. Changing the logarithm base only changes the measurement of entropy. Notice that entropy is the expected value of \(\log_2{\frac{1}{p(x)}}.\)

Extending the definition of entropy to a pair of random variables, the joint entropy \(H(X,Y)\) of discrete random variables \(X\) and \(Y\) with a joint distribution \(p(x,y)\) is

\[H(X,Y) = - \sum_{x \in X}\sum_{y \in Y}p(x,y)\log_2p(x,y).\] The conditional entropy of a random variable \(Y\) given another random variable \(X\) is known is defined as

\[\begin{align*} H(Y|X) &= \sum_{x \in X}p(x)H(Y|X=x) \\ &=-\sum_{x \in X}p(x)\sum_{y \in Y}p(y|x)\log{p(y|x)}\\ &=-\sum_{x \in X}\sum_{y \in Y}p(x,y)\log{p(y|x)}. \end{align*}\]

It also can be shown that

\[H(Y|X) = H(X,Y) - H(X).\] Interpret the equation intuitively, the conditional entropy of \(Y\) given \(X\) is the joint entropy of \(Y\) and \(X\) minus the entropy of \(X\) [\(\ref{Cover&Thomas}\)].

Relative Entropy and Jeffreys’ Divergence

While entropy measures the average amount of information required to describe the random variable, the relative entropy measures the distance or the similarity between two distributions [\(\ref{Cover&Thomas}\)]. The relative entropy \(D(p||q)\) (also known as Kullback-Leibler (KL) distance) between two probability mass functions \(p(x)\) and \(q(x)\) is

\[D(p||q) = \sum_{x \in X}p(x)\log{\frac{p(x)}{q(x)}}.\] KL distance ranges from zero to infinity. It equals zero if and only if \(p(x)\) and \(q(x)\) are equal, and it approaches infinity if \(p(x)\) and \(q(x)\) shares nothing in common.

KL distance is not a true distance measure between distributions since it is not symmetric and does not satisfy the triangle inequality. Jeffreys’ divergence \(J(p||q)\) is a symmetric variant of KL distance, it is defined as [\(\ref{Ryan}\)]

\[\begin{align} J(p||q) &= D(p||q) + D(q||p) \nonumber \\ &= \sum_{x \in X}p(x)\log{\frac{p(x)}{q(x)}} + \sum_{x \in X}q(x)\log{\frac{q(x)}{p(x)}} \nonumber \\ &= \sum_{x \in X}p(x)\log{\frac{p(x)}{q(x)}} - \sum_{x \in X}q(x)\log{\frac{p(x)}{q(x)}} \nonumber \\ &= \sum_{x \in X}(p(x)-q(x))\log{\frac{p(x)}{q(x)}}. \label{eq:kl} \end{align}\]

Jeffreys’ divergence is commonly called the information value (IV) in credit scoring [\(\ref{Anderson}\)]. IV compares the distributions of target being 1 and target being 0, in equation (\(\ref{eq:kl}\)), \(p(x)=p(i \mid y=1)\) and \(q(x)=p(i \mid y=0)\) for binary response. IV measures the predictive power of a characteristic by calculating the information of the dependent variable explained by the independent variable.

Information Value

Information value (IV) can be calculated directly from the Weight of Evidence. The IV for each bin is calculated and summed to get the overall IV for the variable given the bins. It implies that IV is dependent on how the variable is binned and the number of bins of the variable. Assume variable \(X\) has \(n\) distinct bins, refer to equation (\(\ref{eq:woe}\)), the IV of \(X\) in equation (\(\ref{eq:kl}\)) can be written as

\[\begin{align*} IV_X &= \sum_{x \in X}(p(x)-q(x))\log{\frac{p(x)}{q(x)}} \\ &= \sum_{x \in X}\left[p(i \mid y=1)- p(i \mid y=0)\right] \cdot \ln{\frac{p(i\mid y=1)}{p(i\mid y=0)}} \\ &= \sum_{i=1}^{n}\left(\frac{\sum{1_i}}{\sum{1_T}} - \frac{\sum{0_i}}{\sum{0_T}}\right) \cdot WOE_i. \end{align*}\]

A characteristic with a higher IV indicates the predictor variable separates the target distribution better and shares more information with the response variable. The IV is zero if the target distribution is the same as the non-target distribution. IV is always non-negative, predictor variables with IV less than \(0.10\) suggests weak predictive power, while variables with IV greater than \(0.30\) are viewed as strong characteristics that should be included in scoring models [\(\ref{Anderson}\)].

Therefore, information value is commonly used to rank the importance of predictor variables. The advantage of IV is that it can be used for all variable types once they are binned and have gone through Weight of Evidence transformation. However, IV does not consider the association between predictor variables. It is important to examine multicollinearity issue when selecting multiple predictor variables. Variable clustering methods discussed in later section can help reduce this issue.

The function provides four items in the output:

  1. Bin: A data frame with ID, target, and binned predictor variables.
  2. WOE: A data frame with ID, target, and WOE transformed predictor variables.
  3. IV: A data frame of predictor variables and their information values.
  4. vars: A summary of WOE, target rate, and frequency of each bin for each of the predictor variables.

Sorting the information values of the predictor variables in decreasing order, the variables shown in table \(\ref{tab:WOE-ex}\) are the top \(10\) variables in the IV item, which would arguably be the most important variables to be included in our predictive model. For a large dataset with hundreds of potential predictor variables, we usually filter the variables by their information values at the beginning to remove insignificant variables.

head(WOEdata$IV[order(-WOEdata$IV$IV),],10)
Top 10 variables with the highest IV
Variable IV
36 acc_open_past_24mths_Bins 0.1237193
64 num_tl_op_past_12m_Bins 0.1131275
49 mths_since_recent_inq_Bins 0.0911000
71 total_bc_limit_Bins 0.0885029
38 bc_open_to_buy_Bins 0.0845925
69 tot_hi_cred_lim_Bins 0.0766005
47 mths_since_recent_bc_Bins 0.0750128
6 dti_Bins 0.0747749
3 annual_inc_Bins 0.0723737
8 inq_last_6mths_Bins 0.0696226

The vars item showcases a summary for all binned variables. For example, the summary of the bins for the categorical variable “Verification Status” is shown on table \(\ref{tab:vars-in-WOEProfet}\).

vars output for Verification Status variable
verification_status_Bins verification_status_WOE TargetRate Freq
Not Verified -0.3175578 0.0789209 4967
Source Verified -0.0275140 0.1027481 6550
Verified 0.3100625 0.1383002 4483

The argument IVfilter in function allows us to choose a cutoff point for information values. For instance, we choose to filter the variables with information values greater than or equal to \(0.02\), and now we have reduced the number of predictor variables from \(72\) to \(29\). Table \(\ref{tab:IVfilter-tb}\) shows the variables kept after filtering.

subWOEdata = Var_select(WOEdata, "ID", "bad", IVfilter = 0.02)
Filtered variables with IV greater than 0.02
Variable IV
acc_open_past_24mths_Bins 0.1237193
num_tl_op_past_12m_Bins 0.1131275
mths_since_recent_inq_Bins 0.0911000
total_bc_limit_Bins 0.0885029
bc_open_to_buy_Bins 0.0845925
tot_hi_cred_lim_Bins 0.0766005
mths_since_recent_bc_Bins 0.0750128
dti_Bins 0.0747749
annual_inc_Bins 0.0723737
inq_last_6mths_Bins 0.0696226
total_rev_hi_lim_Bins 0.0655617
mo_sin_rcnt_tl_Bins 0.0596471
tot_cur_bal_Bins 0.0587468
verification_status_Bins 0.0583352
avg_cur_bal_Bins 0.0583251
mo_sin_rcnt_rev_tl_op_Bins 0.0557787
mort_acc_Bins 0.0497058
mo_sin_old_rev_tl_op_Bins 0.0425561
purpose_Bins 0.0389740
home_ownership_Bins 0.0373923
mo_sin_old_il_acct_Bins 0.0344106
bc_util_Bins 0.0320390
percent_bc_gt_75_Bins 0.0315360
emp_length_Bins 0.0313749
revol_bal_Bins 0.0250993
open_il_12m_Bins 0.0219061
revol_util_Bins 0.0211043
mths_since_rcnt_il_Bins 0.0208983
open_rv_24m_Bins 0.0202870

Choosing a cutoff point in IV is an intuitive way to perform variable selection when building a credit scorecard. One drawback of this method is that IV does not take into account association between predictor variables. The Rprofet package also includes functions that perform variable selection with variable clustering techniques.

Variable Clustering

Clustering analysis aims at grouping similar objects together. Measuring similarity and dissimilarity between two observations and then later between two clusters of observations is needed to perform clustering analysis [\(\ref{Johnson}\)]. There exist many methods to measure the dissimilarity, one simple and common method is the Euclidean distance. The Euclidean distance \(D_{a,b}\) between two observations \((a,b)\) is the square root of the sum of the squared difference between the two observations in the \(p\)-dimensional sample space, which is calculated using equation (\(\ref{eq:ED}\)).

\[\begin{equation} \label{eq:ED} D_{a,b}=\sqrt{(x_{a,1}-x_{b,1})^2+(x_{a,2}-x_{b,2})^2+...+(x_{a,p}-x_{b,p})^2} \end{equation}\]

An disadvantage is if one variable has much larger values than another, then the Euclidean distance would be skewed based on the variable with larger values. To avoid this problem, the variables are standardized before calculating the Euclidean distance between them using their standardized Z-scores. This is known as the standardized Euclidean distance, and it is a good choice for measuring dissimilarities since all the values are on the same scale [\(\ref{Johnson}\)]. Clustering analysis can be a useful tool in determining the strength of the relationship between observations [\(\ref{Ed}\)]. Similar observations will be grouped into the same cluster, and dissimilar observations will be in different clusters.

To perform variable clustering, the distance between the variables is calculated instead of between observations. A dataset with \(N\) observations and \(M\) variables can be thought of as an \(N \times M\) matrix. Then the matrix will be transposed to an \(M \times N\) matrix so that each row vector represents one variable, and each column vector represents one observation. In this way, equation (\(\ref{eq:ED}\)) calculates the distance between two rows of the matrix which is between the variables [\(\ref{Ed}\)]. The drawback of clustering analysis is the restriction of variable type. Continuous and categorical variables need different dissimilarity measures, and missing values need to be discarded before clustering analysis. In the package, clustering analyses are applied to the Weight of Evidence transformed variables. Figure \(\ref{fig:data-ex}\) shows an example of a transposed dataset with Weight of Evidence transformed variables, where \(W_{1,i_g}\) is the Weight of Evidence value for the \(i\)-th variable given the original value is in the \(g\)-th group for the first observation [\(\ref{Ed}\)].

WOE transformed dataset A transpose

WOE transformed dataset A transpose

It follows that the Euclidean distance for two variables \((i,j)\) in the transposed data is shown in equation (\(\ref{eq:EDwoe}\)). Since the values of Weight of Evidence transformed variables are continuous and on the same scale, it is not necessary to standardize the variables. Missing values are handled by the transformation since they are grouped together and have their own Weight of Evidence values.

\[\begin{equation} \label{eq:EDwoe} D_{i,j}=\sqrt{(W_{1,i_g}-W_{1,j_h})^2+(W_{2,i_g}-W_{2,j_h})^2+...+(W_{N,i_g}-W_{N,j_h})^2} \end{equation}\]

Recall equation (\(\ref{eq:woe}\)) from section \(\ref{woe-sec}\), the difference in the first parenthesis of equation (\(\ref{eq:EDwoe}\)) can be written as equation (\(\ref{eq:edwoeD}\))

\[\begin{align} W_{1,i_g}-W_{1,j_h} &= \ln\left[\frac{\sum{1_g}}{\sum{0_g}}\right] - \ln\left[\frac{\sum{1_T}}{\sum{0_T}}\right] - \left(\ln\left[\frac{\sum{1_h}}{\sum{0_h}}\right] - \ln\left[\frac{\sum{1_T}}{\sum{0_T}}\right]\right) \nonumber \\ &=\ln\left[\frac{\sum{1_g}}{\sum{0_g}}\right] - \ln\left[\frac{\sum{1_h}}{\sum{0_h}}\right], \label{eq:edwoeD} \end{align}\]

where \(\sum{1_g}\) is the number of response equals \(1\) in the \(g\)-th group of the \(i\)-th variable and \(\sum{1_T}\) is the total number of response equals \(1\) in the overall sample. From equation (\(\ref{eq:edwoeD}\)), \(W_{1,i_g}-W_{1,j_h}\) measures the difference in the log-odds of the \(i\)-th and \(j\)-th variables given the bin the first observation’s original values fall into. When clustering the Weight of Evidence transformed variables, the dissimilarity between the variables is still accounted for while simultaneously grouping the variables regarding the information they contain about the response variable [\(\ref{Ed}\)].

After calculating the distance between two observations, we can use different clustering analysis methods to group observations with like information together. Two basic methods of clustering are hierarchical and non-hierarchical, both are unsupervised statistical learning [\(\ref{Ed}\)]. In the package, there are two functions for two different variable clustering algorithms. performs variable clustering using the standard hierarchical clustering technique, while implements k-means clustering which is a non-hierarchical method. Hierarchical clustering and k-means clustering are common data mining methods used in big datasets.

K-means Clustering

K-means clustering starts by setting seeds, which is a process of selecting a set of points from the data equal to the number of clusters we wish to have in the end. Each of these seed points is known as a centroid, which represent the mean or center of a cluster, and the points are usually randomly chosen [\(\ref{Ed}\)]. Next, all of the remaining points in the data will be assigned to the closest centroid based on the dissimilarity measure used and then form clusters. Thus, final clusters formed would depend on the choice of seed points. The result of k-means clustering varies since the seeds are randomly chosen. Users might want to perform the clustering multiple times in order to compare and obtain a reasonable result.

The function in Rprofet references the kmeansvar() function from the ClustOfVar package [\(\ref{clustov}\)]. The clustering methods in ClustOfVar package can be used for a mixture of numeric and categorical variables. The k-means type partitioning algorithm in the kmeansvar() function is based on a principal component method for a mixture of qualitative and quantitative variables (PCAMIX). PCAMIX includes the ordinary PCA and MCA for special cases. A cluster of variables is defined as homogeneous when the variables in the cluster are strongly linked to a central quantitative synthetic variable [\(\ref{clustovpaper}\)]. The link is measured by the squared Pearson correlation for numeric variables and by correlation ratio for the categorical variables. The synthetic variable is the first principal component of PCAMIX for all the variables in the cluster. The clustering algorithm aims to maximize the homogeneity criterion.

Back to our example, we randomly choose to form 10 clusters. The number of clusters can be chosen based on the user’s judgement or some optimization techniques.

set.seed(4172018)
km_cluster = WOEclust_kmeans(subWOEdata, "ID", "bad", num_clusts = 10)
Head of first 10 rows of clustered variables
Variable Group IV
1 acc_open_past_24mths_WOE 1 0.1237193
6 dti_WOE 1 0.0747749
18 num_tl_op_past_12m_WOE 2 0.1131275
9 inq_last_6mths_WOE 2 0.0696226
27 total_bc_limit_WOE 3 0.0885029
4 bc_open_to_buy_WOE 3 0.0845925
28 total_rev_hi_lim_WOE 3 0.0655617
11 mo_sin_old_rev_tl_op_WOE 3 0.0425561
22 purpose_WOE 3 0.0389740
23 revol_bal_WOE 3 0.0250993

Table \(\ref{tab:Kmeans}\) shows a portion of the data frame that is returned from the function. Each group represents one cluster. The variables share more similarities are grouped together and ranked by the information value from high to low. In this way, we can go through and select top \(n\) variables from each cluster to put into our predictive model. There are various measures for selecting variables from clusters. One simple and effective way is to select the top variables from each cluster based on the information values so that the variables with high predictive power are kept while avoiding mullticollinearity issue. Variable clustering is able to reduce the effect of multicollinearity among predictor variables since similar variables would be grouped together and dissimilar variables would be in different groups. For our example, we have selected the top 2 variables (top 1 variable if the cluster has only one variable) from each cluster, shown in table \(\ref{tab:selectedVars}\). Notice now we have gone from 72 possible predictors to only 17 predictors for model building.

top2_km <- km_cluster %>% 
  group_by(Group) %>%
  top_n(n = 2)
Variables selected by K-means clustering
Variable Group IV
acc_open_past_24mths_WOE 1 0.1237193
dti_WOE 1 0.0747749
num_tl_op_past_12m_WOE 2 0.1131275
inq_last_6mths_WOE 2 0.0696226
total_bc_limit_WOE 3 0.0885029
bc_open_to_buy_WOE 3 0.0845925
mths_since_recent_bc_WOE 4 0.0750128
mths_since_recent_inq_WOE 5 0.0911000
mo_sin_rcnt_tl_WOE 5 0.0596471
tot_hi_cred_lim_WOE 6 0.0766005
annual_inc_WOE 6 0.0723737
verification_status_WOE 7 0.0583352
emp_length_WOE 8 0.0313749
bc_util_WOE 9 0.0320390
percent_bc_gt_75_WOE 9 0.0315360
open_il_12m_WOE 10 0.0219061
mths_since_rcnt_il_WOE 10 0.0208983

Hierarchical Clustering

Hierarchical clustering procedures group data points into clusters in a nested sequence of clusters. There are various hierarchical clustering methods available, the most commonly used method is referred to as single-link clustering. One common single-link clustering method is the nearest neighbor method [\(\ref{Johnson}\)]. For a data with \(N\) observations, the clustering method will start with \(N\) clusters containing a single observation. Then the two points that have the smallest distance will be joined into one cluster. Next, the method defines the distance between the new cluster formed and the remaining points as the minimum distance between the two points in the cluster and each of the remaining points. Keep combining clusters based on the selected distance measure one at a time, and the dissimilarity between two clusters is defined to be the distance between their two closet points. This process continues until it reaches a threshold and it ends up forming \(1\) to \(N\) clusters. The number of clusters formed can be determined by plotting the hierarchical tree diagrams or statistical methods such as Beale’s F-Statistic. Oftentimes, simply choosing the number of clusters to begin with is also effective when the users have a basic idea of the number of clusters [\(\ref{Johnson}\)].

For hierarchical clustering, we use the function, and the process is similar to the k-means clustering. For instance, we choose to create 10 clusters as well. The variables with the top 2 largest information values in each cluster were chosen, shown in table \(\ref{tab:selectedVars2}\). The hierarchical method has chosen 18 out of 29 variables, and it has a similar result to the k-means method.

h_cluster <- WOEclust_hclust(subWOEdata, 'ID', 'bad', num_clusts = 10) 

top2_h <- h_cluster %>% 
  group_by(Group) %>%
  top_n(n = 2)
Variables selected by hierarchical clustering
Variable Group IV
acc_open_past_24mths_WOE 1 0.1237
num_tl_op_past_12m_WOE 1 0.1131
mths_since_recent_inq_WOE 2 0.0911
inq_last_6mths_WOE 2 0.0696
total_bc_limit_WOE 3 0.0885
bc_open_to_buy_WOE 3 0.0846
tot_hi_cred_lim_WOE 4 0.0766
tot_cur_bal_WOE 4 0.0587
mths_since_recent_bc_WOE 5 0.0750
mo_sin_rcnt_tl_WOE 5 0.0596
dti_WOE 6 0.0748
annual_inc_WOE 7 0.0724
mo_sin_old_rev_tl_op_WOE 7 0.0426
verification_status_WOE 8 0.0583
bc_util_WOE 9 0.0320
percent_bc_gt_75_WOE 9 0.0315
open_il_12m_WOE 10 0.0219
mths_since_rcnt_il_WOE 10 0.0209

Manual Fine Tuning of the Bins

Recall the first step in the credit scoring process is applying coarse binning to the independent variables, but coarse binning does not always group variables as expected. Fine tuning of the bins for certain variables may be one of the most important steps in the credit scoring process. After variable selection process, we would focus on an important subset of independent variables only. Thus, fine tuning the bins for the selected variables would not be too time consuming.

It is important to fine tune the bins so the bins will follow a logical order. For example, when a customer opens more accounts in the past 2 years, logically their likelihood of defaulting would increase. Visualizing the initial bins and their corresponding WOE values of the variable helps us perform fine tuning of the bins more efficiently. The function allows us to visualize each of the bins for a specific variable.

WOEplotter(binData, 'bad', 'acc_open_past_24mths_Bins')
WOEplotter Output
WOEplotter Output
result <- WOE_customNum(train, "acc_open_past_24mths", "ID", 
                             "bad", breaks = c(0,2,3,4,5,6,7,9,Inf), plot = T)
Customized Binning for Numeric Variable
Customized Binning for Numeric Variable

Figure \(\ref{fig:WOEPlotter}\) shows the result from the function. Ideally, each bin should contain roughly the same number of observations, but in reality some bins contain more and some contain less as the third plot in figure \(\ref{fig:WOEPlotter}\) shown. The trend indicates that as the number of account increases so does the likelihood of defaulting, and it makes logical sense. Sometimes we would like to rearrange, split, or combine some specific bins so that the bins of the variable and the target would have a monotonic relationship. Note that rearranging the bins does not affect anything since all variables were transformed to categorical.

In our case, we want to combine the first two bins since they behave similarly and contain less observations. The function takes the original dataset with user specified buckets to create customized bins for a variable. Using the function for a continuous variable, we can adjust the bins to a more logical trend and more evenly distributed bins based on our judgement. When the plot argument equals TRUE, the function generates visualization for the customized binned variable. The fine-tuned bins will be saved in R for replacing the initial bins.

In figure \(\ref{fig:custom-num}\), we observe the bins are more evenly distributed while maintaining the logical trend with the WOE and the target rate. Using these bins in our predictive model may produce better results than had we used the original bins. Oftentimes, changing to larger bins leads to lower information value due to some information loss, which makes intuitive sense. In the example, the information value of variable “Accounts open in past 24 months” decreased slightly from \(0.1237\) to \(0.1233\).

The Rprofet package also have the ability to fine tune the bins for categorical variables. Originally, the function creates a bin for each unique category in a categorical variable. This sometimes can create a bin with dozens of categories with only a few observations in each. Then we would want to combine categories behave similarly. For instance, we will look at the “Home ownership” variable. Figure \(\ref{fig:WOEPlotter2}\) is an output from the function, and it shows that there are three categories: Mortgage, Own, or Rent. Notice that the bin of Own has much less observations compare to the other two bins, and it behaves similarly to the Rent bin. It is not necessary to combine the bins, but for this example we will combine Own and Rent using the function. Like the customized binning function for numeric variables, this function also bins the original dataset based on user defined buckets (levels). Figure \(\ref{fig:custom-char}\) shows the output of the function. There is a slightly decrease in the information value of the home ownership variable, but the bins perform similarly in a simpler format.

WOEplotter(binData, "bad", "home_ownership_Bins")
WOEplotter Output
WOEplotter Output
WOE = WOE_customFac(train, "home_ownership", "ID", "bad", new_levels = c(1,2,2)
                    ,plot = T)
Customized Binning for Factor Variable
Customized Binning for Factor Variable

In our workflow example, we go through each of the selected variables based on the function and fine tune the variables if necessary. Going through each of the selected predictor variables is a good practice after variable selection, and oftentimes we would need to fine-tune most of the variables.

#acc_open_past_24mths
#WOEplotter(binData, "bad", 'acc_open_past_24mths_Bins')
acc_open_past_24mths_new <- WOE_customNum(train, "acc_open_past_24mths", 
                                               "ID", "bad", 
                                               breaks = c(0,2,3,4,5,6,7,9,Inf)
                                          )$NewBin
#dti
#WOEplotter(binData, 'bad', "dti_Bins")
dti_new <- WOE_customNum(train, "dti", "ID", 'bad',
                              breaks = c(0,11,16,20,25,30,Inf))$NewBin
#num_tl_op_past_12m
#WOEplotter(binData, "bad", 'num_tl_op_past_12m_Bins')
num_tl_op_past_12m_new <- WOE_customNum(train, "num_tl_op_past_12m", "ID", 
                                             "bad", 
                                             breaks = c(0,1,2,3,4,5,8,Inf),
                                             )$NewBin

The chunk of codes showed fine-tuning the first three predictor variables. After the full fine-tuning process, we want to replace the corresponding original binned variables with the new bins. This process could be tedious, to avoid repetitive coding, we recommend naming the fine-tuned variables in a consistent way like the codes have shown. Then we start with a data frame with the id and target variables and left join each of the customized binned variables on the id column. Now the binned data has been updated.

#get the id and target variables from the original binned data
binData_id_tar <- data.frame(binData[,c(1,2)])
#extract the selected variable names
vars_name = top2_km$Variable
#rename the selected variable names (we recommend naming the variables in a 
#consistent way, like what we did in the chuck above)
vars = gsub("_WOE","_new",vars_name)
#write a function to get new variable names
Fun2 <- function(vars) {
  get(vars)
}
#obtain a list of lists, each sub-list contains a dataframe with fine-tuned 
#variables and ID variable
var_list = lapply(vars, Fun2)
#left join each sub-list from the list above on ID variable
binData_new <- binData_id_tar %>%
  left_join(var_list %>% purrr::reduce(left_join, by='ID'), by="ID")

After going through fine-tuning, we want to make sure we run the function again for the new bins in order to recalculate WOE values and information values for the customized variables. Now we obtain a new WOEProfet object with only 17 independent variables.

WOEdata_new = WOEProfet(binData_new, "ID", "bad")

Logistic Regression Model and Stepwise Variable Selectioin

Logistic Regression Model

Logistic regression is commonly used to develop scorecards when the response is a binary (good/bad) qualitative variable. For example, the response variable in the lending club data “bad” falls into one of two categories, yes or no, and they are coded as \(1\) and \(0\). Logistic regression models the probability that the response variable belongs to a category using the logistic function. The output of logistic function is bounded between 0 and 1, which is well-suited for predicting probability of outcomes. The logistic regression model is [\(\ref{ISLR2}\)] \[p(X) = \frac{e^{\beta_0+\beta_1X_1+...+ \beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+ \beta_pX_p}},\] where \(p(X) = P(Y=1)\) at values \(X = (X_1, ..., X_p)\) of \(p\) predictors. Thus, logistic regression models the relationship between \(p(X) = P(Y=1|X)\) and \(X\). Equivalently, the logit or log-odds of probability \(p(X)\) is a linear combination of the predictor variables. \[ logit(p(X)) = \ln\frac{p(X)}{1-p(X)} = \beta_0 + \beta_1X_1 +...+ \beta_pX_p.\] For interpreting a logistic regression model, increasing \(X_1\) by one unit changes the log-odds by \(\beta_1\) with all other predictor variables being fixed.

WOE Transformed Variables in Logistic Regression Models

When a logistic regression model is fitted by a single Weight of Evidence transformed variable and a binary target variable, the model’s intercept is always the sample log-odds of the target rate and the slope is always equal to \(1\). For instance, we fit a simple logistic regression model with WOE transformed variable “dti,” denoted model 1. The model formula obtained is \[\text{model 1:} \hspace{5mm} \ln{\frac{\hat{p(X)}}{1-\hat{p(X)}}}=-2.140 + 1 \cdot \text{(dti\_WOE)}.\] Figure \(\ref{fig:dti}\) shows the WOE and target rate for each bin of variable “dti.” Recall that a logistic regression model estimates the log-odds of a probability, in our example, it estimates the log-odds of the target rate. The round points in figure \(\ref{fig:WOE-glm}\) are the log-odds ratio of the target rates corresponding to each of the WOE values of “dti.” The red straight line goes through the intercept with a slope of the coefficient estimate for the predictor in model 1, and this line also goes through each of the round points.

WOE and Target Rate for each attribute of dti
WOE and Target Rate for each attribute of dti

Then another simple logistic regression model is fitted with WOE transformed variable “bc_util,” denoted model 2. The model formula obtained is \[\text{model 2:} \hspace{5mm} \ln{\frac{\hat{p(X)}}{1-\hat{p(X)}}}=-2.140 + 1 \cdot \text{(bc\_util\_WOE)},\]

which is exactly the same as model 1. Figure \(\ref{fig:bc-util}\) shows the WOE and target rate for each bin of variable “bc_util.” The triangles in figure \(\ref{fig:WOE-glm}\) are the log-odds ratio of the target rates correspond to each of the WOE values of “bc_util.” The red line also goes through each of the triangles since the two models have the same intercept and coefficient estimate. This showcases the interesting relation of WOE transformed variables and logistic regression.

WOE and Target Rate for each attribute of bc_util
WOE and Target Rate for each attribute of bc_util

Note that the intercept in the two models is the log-odds of the target rate for the overall sample: \(\log{\frac{0.1053}{1-0.1053}}=-2.140.\) When we fit a logistic regression model with more than one Weight of Evidence transformed variables, the intercept of the model remains the sample log-odds of the target rate but the coefficient estimates for the independent variables are always less than \(1.\) Let’s fit a third model with both “dti_WOE” and “bc_util_WOE” as independent variables, denoted model 3: \[\text{model 3:} \hspace{5mm} \ln{\frac{\hat{p(X)}}{1-\hat{p(X)}}}=-2.140+0.933\cdot\text{(dti\_WOE)} + 0.828 \cdot \text{(bc\_util\_WOE)}.\] The intercept of model 3 stay the same, and the estimated coefficients of the two variables are less than \(1.\) The orange line in the right plot of figure \(\ref{fig:WOE-glm}\) represents the line going through the intercept with the slope of the coefficient estimate of “dti_WOE” in model 3, and the green line represents the line going through the intercept with the slope of the coefficient estimate of “bc_util_WOE.” Both coefficient estimates are close to \(1,\) which indicates that the two predictor variables are not highly correlated.

Genearl linear models with WOE variables
Genearl linear models with WOE variables

Stepwise Variable Selection

We have talked about using clustering analysis to group variables that contain like information to select significant variables. Another method commonly known as stepwise regression runs regression multiple times to find out the best possible subset regression model using all options available. This technique is computationally intensive, especially when there are numbers of predictor variables in the data [\(\ref{Siddiqi}\)]. Three common stepwise regression techniques are the forward selection, backward elimination, and stepwise selection. The stepwise selection with combinations of the forward selection and the backward elimination procedures is a commonly used technique that adds and removes variables from the model in each step until the best subset is reached. Johnson \(\ref{Johnson}\) states that the the backward elimination procedure selects the best model when the number of variables in the full model does not exceed \(15,\) otherwise the stepwise selection procedure is recommended.

There are various measures developed for assessing model adequacy. Akaike’s information criterion (\(AIC\)) is a popular model selection criteria that provides penalties for adding predictor variables in a model. The \(AIC_p\) criteria can be calculated by [\(\ref{Kutner}\)] \[AIC_p = n\ln SSE_p - n\ln n + 2p,\] where \(n\) is the number of observations in the data and \(p\) is the number of predictors in the model. Notice that the first term \(n\ln SSE_p\) decreases as \(p\) increases since \(SSE\) decreases as the number of predictors in the model increases. The second term is fixed and the third term increases as \(p\) increases. Therefore, models with small \(SSE_p\) and not too large \(p\) will be preferred by the \(AIC\) criteria. A model with small \(AIC_p\) is considered to have a good fit.

The function in the stats package [\(\ref{stats}\)] performs stepwise model selection by relying on AIC to quantify the amount of information loss due to the model simplification. If the data has many variables, the model selection process would be slow. Thus, cutting off variables with small IV before using this function is recommended. For our workflow example, a full logistic regression model was fitted by using all variables in the updated WOEProfet object created after fine-tuning the variables selected by k-means clustering. Then the stepwise selection procedure selected \(14\) out of the \(17\) predictor variables to be included in the reduced logistic model. We may want to check the summary of the reduced model to determine if all these predictors are significant.

fullModel = glm(bad~., data = WOEdata_new$WOE[,-1], family = "binomial")
stepModel = step(fullModel, direction = "both", trace = F)
summary(stepModel)
## 
## Call:
## glm(formula = bad ~ acc_open_past_24mths_WOE + dti_WOE + num_tl_op_past_12m_WOE + 
##     inq_last_6mths_WOE + bc_open_to_buy_WOE + mths_since_recent_bc_WOE + 
##     mths_since_recent_inq_WOE + mo_sin_rcnt_tl_WOE + tot_hi_cred_lim_WOE + 
##     annual_inc_WOE + verification_status_WOE + emp_length_WOE + 
##     bc_util_WOE + mths_since_rcnt_il_WOE, family = "binomial", 
##     data = WOEdata_new$WOE[, -1])
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)               -2.14196    0.02733 -78.367 < 0.0000000000000002 ***
## acc_open_past_24mths_WOE   0.50099    0.11380   4.403      0.0000106983388 ***
## dti_WOE                    0.69053    0.10215   6.760      0.0000000000138 ***
## num_tl_op_past_12m_WOE     0.41095    0.13284   3.094             0.001978 ** 
## inq_last_6mths_WOE         0.29757    0.11899   2.501             0.012395 *  
## bc_open_to_buy_WOE         0.53912    0.12389   4.352      0.0000135188410 ***
## mths_since_recent_bc_WOE   0.56895    0.11434   4.976      0.0000006496098 ***
## mths_since_recent_inq_WOE  0.54960    0.11848   4.639      0.0000035029202 ***
## mo_sin_rcnt_tl_WOE        -0.29039    0.16458  -1.764             0.077670 .  
## tot_hi_cred_lim_WOE        0.79068    0.12016   6.580      0.0000000000469 ***
## annual_inc_WOE             0.38462    0.12088   3.182             0.001464 ** 
## verification_status_WOE    0.45885    0.11300   4.061      0.0000489097099 ***
## emp_length_WOE             0.56914    0.15463   3.681             0.000233 ***
## bc_util_WOE                0.78245    0.18521   4.225      0.0000239227108 ***
## mths_since_rcnt_il_WOE     0.91046    0.23597   3.858             0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10771  on 15999  degrees of freedom
## Residual deviance: 10086  on 15985  degrees of freedom
## AIC: 10116
## 
## Number of Fisher Scoring iterations: 5

Based on the model summary, all the independent variables are significant at \(0.05\) level except “mo_sin_rcnt_tl_WOE”. It is also the only variable with a negative estimated coefficient. Since all the predictor variables were transformed to \(WOE\), all of them should have positive estimated coefficients in the logistic model. The negative sign is likely to be caused by multicollinearity. In other words, some other variables in the model is highly correlated to the variable with negative coefficient. Therefore, the variables with negative coefficient estimates should be removed from the model. Furthermore, the larger the coefficient estimate a variable has, the less correlated the variable is with other variables. Figure \(\ref{fig:corrplot}\) shows the correlation plot for predictor variables in the model with correlation coefficients greater than \(0.3.\) From the plot we observe that “mo_sin_rcnt_tl_WOE” is highly correlated with five other variables in the model. Conversely, variable “mths_since_rcnt_il_WOE” has a high coefficient estimate of \(0.910\) in the summary, which indicates that it has low correlation with other independent variables, and thus it is not shown in figure \(\ref{fig:corrplot}\).

Correlation plot for highly correlated variables
Correlation plot for highly correlated variables

The function is able to select or remove specified variables from the WOEProfet object by inputting the variable names or the variables’ indices based on the IV item of the object. We now select all the variables based on the result of the stepwise model selection, except for variable “mo_sin_rcnt_tl_WOE”. Then a new logistic regression model is fitted.

selected_var = Var_select(WOEdata_new, "ID", "bad", 
                          varcol = c("acc_open_past_24mths_Bins","num_tl_op_past_12m_Bins",
                                     "mths_since_recent_inq_Bins","bc_open_to_buy_Bins",
                                     "mths_since_recent_bc_Bins","dti_Bins", 
                                     "tot_hi_cred_lim_Bins", "annual_inc_Bins", 
                                     "inq_last_6mths_Bins","verification_status_Bins",
                                     "bc_util_Bins","emp_length_Bins",
                                     "mths_since_rcnt_il_Bins"))

new_Model = glm(bad~., data = selected_var$WOE[,-1], family = "binomial")
summary(new_Model)
## 
## Call:
## glm(formula = bad ~ ., family = "binomial", data = selected_var$WOE[, 
##     -1])
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)               -2.14224    0.02733 -78.373 < 0.0000000000000002 ***
## acc_open_past_24mths_WOE   0.49194    0.11350   4.334      0.0000146181789 ***
## num_tl_op_past_12m_WOE     0.30220    0.11757   2.570             0.010158 *  
## mths_since_recent_inq_WOE  0.52192    0.11727   4.451      0.0000085610092 ***
## bc_open_to_buy_WOE         0.53554    0.12381   4.325      0.0000152289930 ***
## mths_since_recent_bc_WOE   0.54307    0.11295   4.808      0.0000015245292 ***
## dti_WOE                    0.68814    0.10213   6.738      0.0000000000161 ***
## tot_hi_cred_lim_WOE        0.79318    0.12012   6.603      0.0000000000402 ***
## annual_inc_WOE             0.38089    0.12086   3.152             0.001624 ** 
## inq_last_6mths_WOE         0.28694    0.11890   2.413             0.015813 *  
## verification_status_WOE    0.46825    0.11285   4.149      0.0000333362987 ***
## bc_util_WOE                0.79317    0.18504   4.286      0.0000181635634 ***
## emp_length_WOE             0.56528    0.15453   3.658             0.000254 ***
## mths_since_rcnt_il_WOE     0.91380    0.23590   3.874             0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10771  on 15999  degrees of freedom
## Residual deviance: 10090  on 15986  degrees of freedom
## AIC: 10118
## 
## Number of Fisher Scoring iterations: 5

The new model summary shows that all \(13\) predictor variables are significant and there is no negative coefficient for the Weight of Evidence transformed variables. We have performed coarse binning on all possible predictors, applied WOE transformation on each of the bins, illustrated different methods of variable selection, fine-tuned the bins for some of the selected predictor variables and recalculated their WOE values, and fitted the final logistic regression model. With the final logistic regression model, we are ready for the final step in the credit scoring process which is to build a scorecard.

Building a Scorecard

Credit risk scorecards offer a powerful, empirically derived solution to business needs. They have been widely used by predicting delinquency nonpayment. A scorecard is a predictive model that provides statistical odds, or probability, that an applicant with any given score will be “good” or “bad” [\(\ref{Siddiqi}\)]. Building a scorecard is a manual process. Multiple models can be built as candidates for the final scorecard, and the one with the best set of characteristics and risk profile would be chosen. Now with the logistic model we built using the selected predictor variables, we can generate our final scorecard. A scorecard needs to have a particular format. Scaling a scorecard refers to define the range and format of scores and the rate of change in odds for increases in score [\(\ref{Siddiqi}\)]. A scorecard can be scaled with a specified base odds at a base point and rate of change of odds or points to double the odds (PDO). The scores of a scorecard can be calculated by

\[\begin{align*} \text{Score} &= \text{Offset} + \text{Factor} \cdot \ln(\text{Odds}) \\ \text{Score} + \text{PDO} &= \text{Offset} + \text{Factor} \cdot \ln(2\cdot\text{Odds}). \end{align*}\]

Solving the equations above for pdo, we have

\[\begin{align} &\text{PDO} = \text{Factor} \cdot \ln(2), \ \text{and thus} \nonumber \\ &\text{Factor} = \text{PDO}/\ln(2) \nonumber \\ &\text{Offset} = \text{Score} - [\text{Factor} \cdot \ln(\text{Odds})]. \label{eq:ofs} \end{align}\]

Notice that the Score and Odds in equation (\(\ref{eq:ofs}\)) are the Base Points and Base Odds. Suppose we choose to scale our scorecard with the standards below:

In other words, the scorecard was being scaled with odds of 10:1 at 1000 points and with odds to double every 200 points. Then we have

\[\begin{align*} \text{Factor} &= \frac{200}{\ln(2)} = 288.539 \\ \text{Offset} &= 1000-288.539 \cdot \ln(10) = 336 \end{align*}\]

It follows that each score corresponding to each attribute can be calculated as:

\[\begin{equation} \label{eq:score} \text{Score} = 336 + 288.539 \cdot \ln(\text{Odds}). \end{equation}\]

Since the logistic regression outputs \(\ln(\text{Odds})\), the score can be obtained in a convenient way. The odds in equation (\(\ref{eq:score}\)) represent odds of target being 1 verses target being 0, so a higher score corresponds to a higher probability of target being 1. Alternatively, if we wish to use the odds of target being 0 verses target being 1, then a higher score corresponds to a lower probability of target being 1. It follows that the equation can be written as:

\[\text{Score} = 336 - 288.539 \cdot \ln(\text{Odds}),\]

since \(\ln\left(\frac{p}{1-p}\right) = -\ln\left(\frac{1-p}{p}\right).\)

Similarly, since Weight of Evidence is used for each grouped predictor variable, the scorecard formulation can be modified as

\[\begin{align} \text{Score} = \ln(\text{odds}) \cdot \text{Factor} + \text{Offset} &= -\left( \sum_{j,\,i=1}^{k,\,n}(\text{woe}_j \cdot \beta_i)+\alpha \right)\cdot \text{Factor} + \text{Offset} \nonumber\\ &= -\left( \sum_{j,\,i=1}^{k,\,n}\text{woe}_j \cdot \beta_i+ \frac{\alpha}{n} \right)\cdot \text{Factor} + \text{Offset} \nonumber\\ &= \sum_{j,\,i=1}^{k,\,n} \left(-\left(\text{woe}_j \cdot \beta_i+ \frac{\alpha}{n}\right) \cdot \text{Factor} + \frac{\text{Offset}}{n}\right), \label{eq:scorecd} \end{align}\]

where
\(woe_j\) = Weight of Evidence for each binned predictor variable
\(\beta_i\) = regression coefficient for each predictor variable
\(a\) = intercept from regression model
\(k\) = number of bins of each predictor variable
\(n\) = number of predictor variables

For our example, we want to build a scorecard that assigns higher scores to observations that are not likely to default. In other words, higher scores will correspond to targets being 0. If an observation’s dti falls into the bucket \([11,16)\) with WOE value of \(-0.248\). The intercept and coefficient for variable dti are \(0.688\) and \(-2.142\), which can be found on the summary of the final logistic regression model we built in section \(\ref{finalGlm}\). Then the points assigned to dti can be calculated as: \[Score = -(-0.248 \cdot 0.688 + \frac{-2.142}{13}) \cdot 288.539 + \frac{336}{13}=122.62 \approx 123 \text{ points}\]

The formula in equation (\(\ref{eq:scorecd}\)) would calculate the points assigned to each bin, for every predictor variable in the scorecard. Summing all the points for each attribute would provide the final score for each observation. Equation (\(\ref{eq:scorecd}\)) indicates that the trend and difference between Weight of Evidence in the binned variables affect the points allocation and also the final score \(\ref{Siddiqi}\).

The function takes a WOEProfet object and a logistic regression model fitted using the WOE transformed predictor variables as inputs to generate a scorecard. We now input the filtered WOE_StepAIC object, the final logistic model, and specify the PDO, baseOdds, and BasePoints the way we want to scale our scorecard. Note that “reverse = FALSE” indicates higher points corresponds to a lower probability of being target or a lower probability of defaulting.

scorecard = ScorecardProfet(selected_var, id = 'ID', target = 'bad', 
                                GLModel = new_Model, PDO = 200, BaseOdds = 10,
                                BasePts = 1000, reverse = FALSE)

Table \(\ref{tab:final-scorecard}\) is the output final scorecard that consists of an overall summary of the bins for each variable and the points assigned to each attribute. After producing the final scorecard, the points allocation for each attribute should be checked. The points should make logical sense and agree with the trend of the corresponding characteristic. The scores along with other business considerations such as expected approval rates, profit, churn, and losses, are then used as a basis for decision making [\(\ref{Siddiqi}\)].

Final Scorecard
Attribute Bins WOE Points
acc_open_past_24mths [0,2) -0.6150278 161
acc_open_past_24mths [2,3) -0.3722626 126
acc_open_past_24mths [3,4) -0.1882043 100
acc_open_past_24mths [4,5) -0.0665712 83
acc_open_past_24mths [5,6) 0.0677955 64
acc_open_past_24mths [6,7) 0.1137188 57
acc_open_past_24mths [7,9) 0.4097335 15
acc_open_past_24mths [9,Inf) 0.5111722 1
num_tl_op_past_12m [0,1) -0.5178928 119
num_tl_op_past_12m [1,2) -0.3292723 102
num_tl_op_past_12m [2,3) 0.1128501 64
num_tl_op_past_12m [3,4) 0.1391404 61
num_tl_op_past_12m [4,5) 0.3006259 47
num_tl_op_past_12m [5,8) 0.4140323 37
num_tl_op_past_12m [8,Inf) 0.8792887 -3
mths_since_recent_inq [0,1) 0.3673093 18
mths_since_recent_inq [1,2) 0.2874294 30
mths_since_recent_inq [2,4) 0.1385129 53
mths_since_recent_inq [4,7) 0.1181563 56
mths_since_recent_inq [7,10) 0.0230192 70
mths_since_recent_inq [10,12) -0.0914692 87
mths_since_recent_inq [12,15) -0.1304416 93
mths_since_recent_inq [15,22) -0.3213718 122
mths_since_recent_inq [22,Inf) -0.5251574 152
mths_since_recent_inq Missing -0.7662417 189
bc_open_to_buy [0,330) 0.2717207 31
bc_open_to_buy [330,900) 0.2299999 38
bc_open_to_buy [900,1750) 0.2085680 41
bc_open_to_buy [1750,2810) 0.1864791 45
bc_open_to_buy [2810,4250) 0.0844051 60
bc_open_to_buy [4250,6600) 0.0237787 70
bc_open_to_buy [6600,9500) -0.0125384 75
bc_open_to_buy [9500,14500) -0.2797670 117
bc_open_to_buy [14500,25300) -0.3653031 130
bc_open_to_buy [25300,Inf) -0.7475561 189
bc_open_to_buy Missing 0.3919541 13
mths_since_recent_bc [0,4) 0.2031347 42
mths_since_recent_bc [4,8) 0.1678404 47
mths_since_recent_bc [8,14) 0.1676860 47
mths_since_recent_bc [14,19) 0.0916858 59
mths_since_recent_bc [19,27) -0.1734459 101
mths_since_recent_bc [27,42) -0.1710642 100
mths_since_recent_bc [42,84) -0.5271410 156
mths_since_recent_bc [84,Inf) -0.7418607 190
mths_since_recent_bc Missing 0.3984266 11
dti [0,11) -0.3298119 139
dti [11,16) -0.2475029 123
dti [16,20) -0.0683093 87
dti [20,25) 0.1290668 48
dti [25,30) 0.2252770 29
dti [30,Inf) 0.4900570 -24
tot_hi_cred_lim [2500,28000) 0.2868471 8
tot_hi_cred_lim [28000,42000) 0.2336118 20
tot_hi_cred_lim [42000,61000) 0.1930912 29
tot_hi_cred_lim [61000,93200) 0.1884696 30
tot_hi_cred_lim [93200,132000) -0.0708403 90
tot_hi_cred_lim [132000,220000) -0.1141389 99
tot_hi_cred_lim [220000,330000) -0.4301702 172
tot_hi_cred_lim [330000,Inf) -0.4594614 179
annual_inc [6000,32100) 0.4799147 21
annual_inc [32100,40000) 0.1767463 54
annual_inc [40000,48500) 0.1345241 59
annual_inc [48500,55000) 0.0409893 69
annual_inc [55000,65100) 0.0116813 72
annual_inc [65100,75000) -0.0594305 80
annual_inc [75000,90000) -0.1262146 87
annual_inc [90000,125000) -0.3394268 111
annual_inc [125000,160000) -0.7044173 151
annual_inc [160000,Inf) -0.1484633 90
inq_last_6mths [0,1) -0.2039304 90
inq_last_6mths [1,2) 0.1215340 63
inq_last_6mths [2,3) 0.4804489 34
inq_last_6mths [3,Inf) 0.5836116 25
verification_status Not Verified -0.3175578 116
verification_status Source Verified -0.0275140 77
verification_status Verified 0.3100625 31
bc_util [0,17) 0.0450363 63
bc_util [17,34) -0.3513377 154
bc_util [34,55) -0.1999002 119
bc_util [55,65) -0.0165624 77
bc_util [65,74) 0.0370632 65
bc_util [74,89) 0.1229216 45
bc_util [89,96) 0.1506662 39
bc_util [96,Inf) 0.2330074 20
bc_util Missing 0.3597591 -9
emp_length < 1 year/1 year/2 years 0.0869045 59
emp_length 3 years/4 years/5 years -0.0272300 78
emp_length 6 years/7 years/8 years/9 years -0.0625190 84
emp_length 10+ years -0.1522784 98
emp_length n/a 0.5007768 -8
mths_since_rcnt_il [1,7) -0.0372733 83
mths_since_rcnt_il [7,14) -0.3911738 177
mths_since_rcnt_il [14,30) -0.9469443 323
mths_since_rcnt_il [30,Inf) -1.1501025 377
mths_since_rcnt_il Missing 0.0230874 67

Scoring Validation Data

For our workflow example, we split the data into training and validation datasets and run the credit scoring process on the training data, then use the scorecard built on the training data to score the validation data. The function scores the validation data based on the final scorecard. In order to use this function, the validation data needs to be binned the same way as the binned data from the training set, or the bins shown on the final scorecard. This process would not be time consuming since the number of predictors in a final scorecard is usually less than 20. In our case, there are 13 predictors in the scorecard and we only need to bin these predictors in the validation data.

We first filter down the validation data to the variables in the final scorecard only. Then the function can cut the numeric variables in the same way as the corresponding bins in the scorecard. Similarly, the function is able to group the categorical variables. The codes shows an example of binning the first predictor on the scorecard, “acc_open_past_24mths”. The endpoints of the interval for each bin of the variable should be put into the “breaks” argument, and notice that the bracket of the interval needs to be consistent with the scorecard. Figure \(\ref{fig:valid-ex}\) shows the variable binned in the same manner as the scorecard indicates.

(names <- unique(scorecard$Attribute))
##  [1] "acc_open_past_24mths"  "num_tl_op_past_12m"    "mths_since_recent_inq"
##  [4] "bc_open_to_buy"        "mths_since_recent_bc"  "dti"                  
##  [7] "tot_hi_cred_lim"       "annual_inc"            "inq_last_6mths"       
## [10] "verification_status"   "bc_util"               "emp_length"           
## [13] "mths_since_rcnt_il"
valid <- validation %>% dplyr::select(ID, bad, all_of(names))

acc_open_past_24mths = WOE_customNum(valid, "acc_open_past_24mths", "ID", "bad", 
                                      breaks = c(0,2,3,4,5,6,7,9,Inf),
                                      right_bracket = F, plot = T)$NewBin
Binning variable in the validation set
Binning variable in the validation set

Repeat this process for all the significant predictor variables, then the validation set should be ready for scoring. Before feeding the validation set into the function, we need to make sure that the variable names on the binned validation set is consistent with the variable names on the final scorecard. Table \(\ref{tab:scored-data}\) shows a snapshot of the first 4 observations from the output data frame of the function. For example, the person with ID 194459 is defaulted. The “acc_open_past_24mths” falls into the bucket of \([5,6)\), and our final scorecard assigned \(64\) points to this attribute. Repeat this for the rest 12 attributes, we sum up the corresponding points and get \(64+64+30+41+101-24+99+69+63+77+39+84+67=774.\) Thus, the final score assigned to this person is \(774.\) It makes sense that the next three observations did not default and had higher scores assigned. In summary, credit scoring provides an objective way to assess the level of risk associated with applicants.

colnames(binData_valid) <- gsub("_Bins","", colnames(binData_valid))
valid_score <- ScoreDataProfet(binData_valid, scorecard, "ID", "bad")
Head of the scored validation data
ID bad acc_open_past_24mths acc_open_past_24mths_Points num_tl_op_past_12m num_tl_op_past_12m_Points mths_since_recent_inq
194459 1 [5,6) 64 [2,3) 64 [1,2)
81020 0 [0,2) 161 [0,1) 119 Missing
203600 0 [3,4) 100 [3,4) 61 [4,7)
20113 0 [3,4) 100 [1,2) 102 [12,15)
mths_since_recent_inq_Points bc_open_to_buy bc_open_to_buy_Points mths_since_recent_bc mths_since_recent_bc_Points dti dti_Points tot_hi_cred_lim
30 [900,1750) 41 [19,27) 101 [30,Inf) -24 [132000,220000)
189 [9500,14500) 117 [84,Inf) 190 [16,20) 87 [61000,93200)
56 [9500,14500) 117 [27,42) 100 [0,11) 139 [42000,61000)
93 [9500,14500) 117 [8,14) 47 [16,20) 87 [42000,61000)
tot_hi_cred_lim_Points annual_inc annual_inc_Points inq_last_6mths inq_last_6mths_Points verification_status verification_status_Points
99 [48500,55000) 69 [1,2) 63 Source Verified 77
30 [55000,65100) 72 [0,1) 90 Verified 31
29 [55000,65100) 72 [0,1) 90 Source Verified 77
29 [32100,40000) 54 [0,1) 90 Source Verified 77
bc_util bc_util_Points emp_length emp_length_Points mths_since_rcnt_il mths_since_rcnt_il_Points Score
[89,96) 39 6 years/7 years/8 years/9 years 84 Missing 67 774
[55,65) 77 6 years/7 years/8 years/9 years 84 Missing 67 1314
[34,55) 119 < 1 year/1 year/2 years 59 Missing 67 1086
[34,55) 119 < 1 year/1 year/2 years 59 Missing 67 1041

Model Validation and Predictive Power

It is important to evaluate or compare scorecards in order to know how good the scorecard is or to choose the best scorecard. Scorecards are predictive models that predict the probability of a case being good or bad (or a target being \(1\) or \(0\)) [\(\ref{Siddiqi}\)]. Gains tables, the ROC curves, and the KS statistics are three commonly used methods for accessing binary classification model performance. Some functions from the ROCit package [\(\ref{ROCit}\)] were used to generate results for each of the three measures.

Gains Table

Gains table evaluates the predictive power of classification models. It measures how much better one can do with the predictive model compared to without a model. The scored data were ranked in descending order and split into buckets, and it is common to create \(10\) buckets (deciles). Table \(\ref{tab:gains-tb-train}\) and table \(\ref{tab:gains-tb-test}\) are the gains tables for the training data and validation data respectively. The gain at a given depth is the ratio of cumulative number of (positive) responses up to the bucket to the total number of (positive) responses in the sample. For example, at the depth of 0.5 on the gains table for the training set, we interpret it as \(74.4\%\) of targets covered in top \(50\%\) of training data with the model we fitted. The baseline on the gains chart is the random response without the model, so gains of a model sit above the baseline are considered to be better than random guesses. The gains tables are generated from the ROCit package [\(\ref{ROCit}\)] and the gains charts (figure \(\ref{fig:gains}\)) are generated based on the corresponding tables. The gains charts indicate that the model performed slightly better on the training set and the overall model performance on the two sets are similar.

Gains table for the training data
Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate Lift CLift Avg Score
1 1600 1600 0.1 389 389 0.243125 0.2431250 0.2308605 2.3086053 2.308605 1401
2 1600 3200 0.2 267 656 0.166875 0.2050000 0.3893175 1.5845697 1.946587 1236
3 1600 4800 0.3 233 889 0.145625 0.1852083 0.5275964 1.3827893 1.758655 1150
4 1600 6400 0.4 207 1096 0.129375 0.1712500 0.6504451 1.2284866 1.626113 1084
5 1600 8000 0.5 160 1256 0.100000 0.1570000 0.7454006 0.9495549 1.490801 1027
6 1600 9600 0.6 144 1400 0.090000 0.1458333 0.8308605 0.8545994 1.384768 971
7 1600 11200 0.7 100 1500 0.062500 0.1339286 0.8902077 0.5934718 1.271725 916
8 1600 12800 0.8 80 1580 0.050000 0.1234375 0.9376855 0.4747774 1.172107 859
9 1600 14400 0.9 65 1645 0.040625 0.1142361 0.9762611 0.3857567 1.084735 791
10 1600 16000 1.0 40 1685 0.025000 0.1053125 1.0000000 0.2373887 1.000000 663
Gains table for the validation data
Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate Lift CLift Avg Score
1 400 400 0.1 71 71 0.1775 0.1775000 0.1825193 1.8251928 1.825193 1386
2 400 800 0.2 76 147 0.1900 0.1837500 0.3778920 1.9537275 1.889460 1226
3 400 1200 0.3 49 196 0.1225 0.1633333 0.5038560 1.2596401 1.679520 1142
4 400 1600 0.4 46 242 0.1150 0.1512500 0.6221080 1.1825193 1.555270 1077
5 400 2000 0.5 41 283 0.1025 0.1415000 0.7275064 1.0539846 1.455013 1017
6 400 2400 0.6 35 318 0.0875 0.1325000 0.8174807 0.8997429 1.362468 963
7 400 2800 0.7 24 342 0.0600 0.1221429 0.8791774 0.6169666 1.255968 911
8 400 3200 0.8 24 366 0.0600 0.1143750 0.9408740 0.6169666 1.176092 853
9 400 3600 0.9 17 383 0.0425 0.1063889 0.9845758 0.4370180 1.093973 784
10 400 4000 1.0 6 389 0.0150 0.0972500 1.0000000 0.1542416 1.000000 657
Gains Charts
Gains Charts

The ROC Curve and AUC Statistic

The Receiver Operating Characteristic (ROC) curve is originally a part of the signal detection theory which was developed in the 1950s and 1960s. The ROC curve later became widely used in radiology and medical diagnostic testing, and it also had been developed as a statistical tool for measuring model performance [\(\ref{Pepe}\)]. Today, the ROC curve is a common tool used for assessing the predictive power of classification models in the credit scoring industry.

Measures of Accuracy for Binary Classifiers

Classification models with binary outcome is the simplest classification problem yet is widely applicable. For example, the response in the lending club data only has two outcomes, defaulted or not defaulted. A binary classifier would be a good model to predict such response. After fitting classification models, their performance need to be assessed. The result of a binary classifier can be classified into four groups [\(\ref{ROCit}\)]:

  • True Positive: when the observation is positive and the classifier correctly predicted it to be positive.
  • False Negative: when the observation is positive but the classifier incorrectly predicted it to be negative.
  • True Negative: when the observation is negative and the classifier correctly predicted it to be negative.
  • False Positive: when the observation is negative but the classifier incorrectly predicted it to be positive.

Confusion matrix is a \(k \times k\) table that allows visualization of a classification model, where \(k\) is the number of classes of the response variable. For binary classifiers, the confusion matrix is a \(2 \times 2\) table (figure \(\ref{fig:confusion-m}\)). There are a number of metrics can be calculated from a confusion matrix.

Confusion Matrix

Confusion Matrix

Misclassification rate is the most common metric used to valid a binary classifier, it represents the probability of the classifier makes wrong prediction, which can be calculated as [\(\ref{ROCit}\)]

\[\begin{align*} Misclassification &= P(\hat{Y} \neq Y) \\ &= P(\hat{Y}=0 \mid Y=1) + P(\hat{Y}=1 \mid Y=0) \hspace{0.5cm} \text{since the two events are disjoint} \\ &= \frac{FN}{TN+FP+FN+TP} + \frac{FP}{TN+FP+FN+TP} \\ &= \frac{FN+FP}{TN+FP+FN+TP}. \end{align*}\]

The accuracy is simply

\[ Accuracy = P(\hat{Y} = Y) = 1-Missclassification,\] which is the probability of correct classifications. Misclassification rate and accuracy rate provide an overview of binary classifier performance. However, the accuracy of the classifier predicting the positive cases and the negative cases individually are oftentimes more useful metrics. A classifier can produce two types of errors, false positive errors and false negative errors. An ideal classifier would produce neither of these errors and every predicted response would be either a true positive or a true negative. The fractions of true and false positives are defined as:

\[\begin{align*} False Positive Rate (FPR) &= P(\hat{Y}=1 \mid Y=0), \\ True Positive Rate (TPR) &= P(\hat{Y}=1 \mid Y=1). \end{align*}\]

The FPR and TPR are known by several other terms. In biomedical research, the sensitivity (TPR) and specificity (1 - FPR) are often used to describe test performance [\(\ref{Pepe}\)]:

\[\begin{align*} Sensitivity &= TPR = \frac{TP}{TP+FN}, \\ Specificity &= 1 - FPR = 1-\frac{FP}{FP+TN} = \frac{FP+TN-FP}{FP+TN}=\frac{TN}{FP+TN} = TNR. \end{align*}\]

In other words, sensitivity is the true positive rate and the specificity is the true negative rate.

The ROC Curve

After the classification models made predictions, many of them output probability scores or some quantitative results instead simple classification. A suitable threshold is often chosen to dichotomize the scores in order to calculate metrics and assess the model performance. For example, in healthcare industry research, a lower threshold would be selected for a serious disease. The choice of threshold depends on the trade-off that is acceptable between failing to detect an event happen and falsely identifying the event happen [\(\ref{Pepe}\)]. The metrics we have mentioned previously only measure model performance at a certain threshold. The ROC curve is able to describe the range of trade-offs that a classification model can achieve. The ROC curve is obtained by plotting the sensitivity along y-axis and the corresponding \(1-\text{specificity}\) (i.e. FPR) along x-axis for all possible thresholds.

Suppose a threshold \(c\) is selected, and the predicted values from the binary classifier is \(\hat{Y}\), it follows that the true and false positive rates at the threshold \(c\) become [\(\ref{Pepe}\)]

\[\begin{align*} TPR(c) &= p(\hat{Y} \ge c \mid Y=1), \\ FPR(c) &= p(\hat{Y} \ge c \mid Y=0). \end{align*}\]

Then the ROC curve is the entire set of possible TPR and FPR attained by dichotomizing \(\hat{Y}\) with different thresholds. Thus, the ROC curve can be written as [\(\ref{Pepe}\)]

\[\begin{equation} \label{eq:roc1} ROC(\cdot) = \{\left(FPR(c),TPR(c)\right): -\infty<c<\infty\}. \end{equation}\]

Notice that when the threshold \(c\) increases, both \(TPR(c)\) and \(FPR(c)\) decrease. In theory, threshold \(c\) could be any real number so we can consider the extreme cases. If \(c=\infty\), then \(\lim_{x\to\infty}TPR(c) = \lim_{x\to\infty}FPR(c) = 0.\) If \(c=-\infty\), then \(\lim_{x\to-\infty}TPR(c) = \lim_{x\to-\infty}FPR(c) = 1.\) Therefore, the ROC curve is a monotonically increasing function in the first quadrant. The ROC curve expressed in equation (\(\ref{eq:roc1}\)) can also be written as [\(\ref{Pepe}\)]

\[\begin{equation} ROC(\cdot) = \{\left(t,ROC(t)\right): 0\le t\le 1\}, \end{equation}\]

where \(FPR(c)=t\) and the ROC function maps \(t\) to \(TPR(c).\)

When the prediction from the binary classifier is unrelated to the true outcome, the probability distributions for \(\hat{Y}\) of both groups (\(\hat{Y}=0\) and \(\hat{Y}=1\)) are the same. In this case, \(TPR(c)=FPR(c)\) for all thresholds and \(ROC(t)=t,\) which indicates that the ROC curve is a diagonal line with unit slope on the first quadrant and is equivalent to a random guess. On the other hand, when the binary classifier perfectly separates the true outcome, then \(TPR(c)=1\) and \(FPR(c)=0\) at some threshold \(c.\) That is, the ROC curve is along \(x=0\) and \(y=1\) in the first quadrant.

Most binary classifiers have ROC curves stay between the two extreme cases. For model \(a\) and \(b\) and some thresholds \(c_a\) and \(c_b\) such that \(TPR_a(c_a)=TPR_b(c_b)\), model \(a\) performs better with the corresponding false positive rates \(FPR_a(c_a) < FPR_b(c_b).\) Figure \(\ref{fig:model-roc}\) shows the ROC curves for the final scorecard applied to the training set and the validation set respectively. The separations between distributions on the two plots look similar. The scorecard seems to perform better on the training set by eye.

ROC Curves for Training and Validation Sets
ROC Curves for Training and Validation Sets

Summarizing ROC Curve and AUC

Summarizing the ROC curve in a single quantitative index is useful when comparing different ROC curves. The most widely used summary measure is the area under the ROC curve (AUC) defined as [\(\ref{Pepe}\)]

\[AUC = \int_{0}^{1} ROC(t) \,dx. \]

A classifier with the perfect ROC, shown in figure \(\ref{fig:roc}\) with the green ROC curve, would have an AUC value of \(1\). An uninformative classifier, such as the red ROC curve in figure \(\ref{fig:roc}\), would have an AUC value of \(0.5\). Most classifiers have AUC values fall in between \(0.5\) and \(1\). When comparing binary classifiers, a common interpretation is that a better classifier has a larger AUC value. From figure \(\ref{fig:roc}\), we can see that the scorecard on the training set performed slightly better since its ROC curve is closer to the perfect case overall. However, when FPR approaches to \(0.8\) the model separates the response better on the validation set. The AUC value for both datasets are shown on table \(\ref{tab:AUC-KS-tb}\). The training set has a larger AUC value, which indicates that the scorecard performed better on the training data.

ROC Curves
ROC Curves

Kolmogorov–Smirnov (KS) Statistics

KS statistics is another commonly used statistics for evaluating scorecard performance. The Kolmogorov–Smirnov test is originally based on the maximum difference between two cumulative distributions [\(\ref{Massey}\)]. KS statistics is closely related to the Kolmogorov–Smirnov test and it can be written as [\(\ref{Pepe}\)]

\[KS = max_{t}\lvert ROC(t)-t \rvert = max\lvert TPR(c) - FPR(c) \rvert.\]

Thus, KS statistics is the maximum difference between the cumulative capture rate of positive and negative responses with a given threshold. In a credit scoring process, the KS plot provides data visualization that illustrates scorecard effectiveness. The KS plots for the training data and validation data for our example are shown in figure \(\ref{fig:KS}\). The KS statistics is the largest vertical difference between the cumulative TPR and FPR distributions on the plots. KS value is bounded between \(0\) and \(1\). A higher KS value indicates that the model provides more predictive power since it generally means that the model separates the outcome of the target better. However, KS values usually range from \(20\%\) to \(70\%\), below the range indicates that the model should be questioned and above the range suggests that the model might be too good to be true [\(\ref{Anderson}\)]. The KS values for both dataset are also shown on table \(\ref{tab:AUC-KS-tb}\). KS statistics indicates that the scorecard performed slightly better on the training data, which agrees with the AUC.

KS Plots
KS Plots
Fit statistics table
Data AUC KS
Training Data 0.689 0.287
Validation Data 0.674 0.267

This document has demonstrated the workflow of building a credit scorecard using the library and scorecard validation. This library performs different steps in the credit scoring process and provides a variety of options and customization to the user. Further enhancements to the package are planned and in development.