Credit risk Modeling: CoMbining ClassifiCation and regression algorithMs to PrediCt exPeCted loss

Credit risk assessment is of paramount importance in the financial industry. Machine learning techniques have been used successfully over the last two decades to predict the probability of loan default (PD). This way, credit decisions can be automated and risk can be reduced significantly. In the more recent parts, in - tensified regulatory requirements led to the need to include another parameter – loss given default (LGD), the share of the loan which cannot be recovered in case of loan default – in risk models. We aim to build a unified credit risk model by estimating both parameters jointly to estimate expected loss. A large, high-dimensional, real world dataset is used to benchmark several combinations of classification, regression and feature selection algorithms. The results indicate that non-linear techniques work especially well to model expected loss


Introduction
Credit scoring, the numerical assessment of credit default risk, was first developed in the 1940's and has constantly evolved ever since. Credit scoring is an important first step towards reducing credit risk and is proven to be highly effective. While classical techniques included scorecards and statistical techniques like logistic regression, with the advent of data mining, credit risk analysts started to utilize modern machine learning algorithms for credit scoring (Yap et al., 2011). The main aim of credit scoring is to estimate expected loss (EL), which is defined as where the parameters are probability of default (PD), loss given default (LGD) and exposure at default (EAD) (Bluhm et al., 2003). PD indicates how likely a borrower is to not be able to (fully) pay back their loan.
LGD is the share of the loan which the issuer will not be able to recover. EAD is the amount of money at risk. It is also common to express EL as a percentage figure of EAD (Basel Commitee on Banking Supervision, 2005): Gatti, & Querci, 2008). One benchmarking study on LGD was conducted by Loterman et al. ( Loterman et al., 2012). Here, a large variety of regression algorithms were benchmarked on 5 real-world credit datasets. Loterman et al. provide evidence that non-linear techniques, like Support Vector Machines (SVM) and Artificial Neural Nets (ANN), outperform "traditional" linear techniques, which suggests that there exist non-linear relationships between the features and LGD parameter (Loterman et al., 2012). This is supported by Tobback et al. (2014) who found that non-linear support vector regression gives the best results when forecasting LGD.
In estimating PD the machine-learning task is expressed as a binary classification problem. Since the last decade researchers have developed a broad variety of approaches to classification problems. For validating the performance of the classification mainly one benchmark data set -the "German Credit" data set 1 -has been used. The most researched machine learning techniques for credit scoring are ANN, SVM and ensemble methods. It is known that the "German Credit" data, an SVM-based model, performs very well and better that ANN approach (Heiat, 2012). However, this improvement over neural nets is only marginal. For estimating LGD, the task is different. As it is a continuous percentage share, regression models have to be developed to model it. Here, the natural skewness of the dataset further complicates the task.
One of the distinctive features of this paper compared to the previous research is the use of a new dataset. This article largely follows the methodology of Loterman et al. (Loterman et al., 2012) and aims to compare several LGD estimation techniques. The dataset we explore is inherently different from the previous studies, exhibiting much more features. Unlike the previous works we introduce various feature selection methods to reduce the dimensionality of the data. The dataset used in this article comes from a machine learning competition sponsored by the Imperial College London through the "kaggle" platform 2 . We will therefore refer to this data as the "kaggle" dataset. The outcomes of this article are largely based on the experiences of the first author, who participated in the challenge ranking within the top 10% of the contenders.

Credit Scoring Datasets
The "kaggle" dataset presents challenges in the following three dimensions: • number of features; • balance of the data; • outcomes estimated.
As we have mentioned above, most researchers have focused on modeling the PD parameter, utilizing one particular data set for benchmarking: the "German Credit" data set. Some researchers also used the "Australian Credit" data set 3 . In Table 1 we provide the comparison of the data sets with respect to their size. of the data (the number of observations) requires faster optimization procedures. Recent advancements in large scale convex optimization are briefly discussed in this article in application to LGD prediction task.
Datasets analyzed for predicting probability of default were fairly well balanced (about 30% of defaults). In a real-life setting, it is more likely to encounter significantly more imbalanced data sets, since the rate of default on consumer credit in the U.S., for example, was only 1.5% in 2012 (2013). The kaggle dataset exhibits roughly 10% of defaults, this being a better representation of the real-life data. It also presents additional challenges. Particularly, it is not entirely clear whether the asymptotic consistency of the crossvalidation procedure is preserved in this setting.
Additionally, most past research has focused exclusively on one of the two parameters (PD and LGD) described before. Loss given default research has commonly utilized training data consisting only of defaulters for their regression models, while PD-research has only investigated classification methods. In this article we consider both tasks together, developing a hybrid approach using a single dataset.
Finally, the features of the kaggle dataset were completely anonymous. The organizers of the challenge intentionally did not provide any information that would describe the features in any way, naming them simply as F1 ... Fm. The following plots show the distribution of the response variable, highlighting the skewness of the data.

Experimental Set-Up
The kaggle data were preprocessed in two steps. First, missing values were imputed with the column means because most predictive algorithms in the package used cannot handle missing values, and there is evidence that missing value imputation improves prediction accuracy (Batista and Monard, 2003). Then, the data were scaled. Scaling the data is due to several practical considerations. Some values in the original data were so high that the software package could not handle them and raised an infinity error. Furthermore, especially for support vector machines it is strongly recommended to scale the data before analysis (Hsu et al., 2010).
Two metrics are considered for the purpose of technique benchmarking. F1 score is used to assess the accuracy of the defaulters classification task: where TP is the sum of true positives, i.e. the number of observations correctly classified as "1" (loan default), FN is the sum of false negatives, i.e. the number of observations falsely classified as "0" (no loan default), and FP is the sum of false positives, i.e. the number of observations falsely classified as "1". F1 is a very popular metric to measure the performance of binary classifiers (Zhang and Zhang, 2004). The higher F1 score corresponds to the better performance of the classifier. Mean absolute error (MAE) is used to measure the performance of LGD estimation. MAE was also the used to evaluate the contenders of the "kaggle" challenge. The comparison of estimation techniques is facilitated through the use of five-fold cross-validation procedure, which in most cases produces a reasonable tradeoff between variance and bias (Kohavi, 1995).

Feature Selection
Feature subset selection is the process of selecting features to be used in machine learning models. This research direction emerged with the (horizontal, number of parameters) growth of data available, reaching a preliminary peak in 2003, with a special issue of the Journal of Machine Learning Research devoted to it. The primary motivations for feature selection are: • improvement of predictive accuracy; • data storage; • computational cost.
Feature selection methods can be loosely grouped in 3 categories: filters, wrappers and embedded methods. Filters apply simple, mostly univariate, correlation based criteria to detect relationships between individual features and the response. They thus act independently of the chosen learning algorithm. Univariate approaches are usually fast, but not always accurate. With wrapper methods, the predictive performance of an algorithm is compared for different subsets of features. Depending on the search method chosen, this could deliver very good results but comes at a potentially high, or even prohibitively high, computational cost -consider, for instance, an exhaustive search over our dataset with 759 features.
Embedded methods are learning algorithms which already incorporate implicit feature selection. Examples include l1-regularized ("sparse") models and decision trees. The resulting models can either be used for predictive purposes on their own or the selected features can be fed to another algorithm. Embedded methods are able to capture interdependencies among the features better than filers and wrappers. For that reason, as well as for computational considerations, we chose to compare the latter method, namely l1-regularized linear models, to the first. We hypothesize initially that only a small fraction of the available features are actually relevant for predictive purposes. So we choose to have the univariate feature selection procedure for the top 50 features, and set the regularization parameter of the l1-based feature selection parameter achieving roughly the same number of features.

Classification and Regression Techniques
The table below summarizes different prediction techniques that are compared in this article. We split them in two categories: one-step and two-step. One step techniques perform only the binary classification task thus labeling defaulters. Instead of predicting the precise LGD value for the defaulters a certain general rule is used. Two-step techniques use regression tools to predict the LGD value based on the features given.
Further, we introduce a naïve strategy of assigning "0" expected loss value to all observations. That is predicting that nobody will default at all. Apart from being compared to each other the "naïve" strategy is used as a benchmark, and those models that can outperform it are considered to be viable solutions for the predictive task.

Results and Discussion
The table below summarizes the results for various methods achieved without the use of feature selection. The results for the models without feature selection are shown below. The results for the techniques where feature selection was applied are summarized below.  From the tables above we can conclude that BT is the best technique. LinearSVM shows good performance in case of l1 feature selection. BT clearly outperforms the LinearSVM not only general performance (MAE score), but also in the classification stage (F1).
Other techniques are beaten by the naïve strategy only in presence of the l1 feature selection (regularization). We believe that is largely due to the overfitting problem. Overfitting may occur either due to sparseness of data caused by many features or due to certain properties of the underlying process that the above techniques fail to capture. One of such properties is of non-linear relationship between the features and LGD. of linear models for classification (good classification performance). It is especially surprising that Ridge regression, being robust to overfitting, fails to produce a desirable result after successful classification.
ANOVA-F based feature selection also did not improve the performance of the techniques failing to capture the right structure of the features. This is clearly a too primitive technique for our dataset with many features, where correlation between the features has to be taken into account.
Our results are quite surprising. The performance gap between linear models, hardly managing to keep up with the naïve strategy, and non-linear model (BT) is much bigger than that discovered by Loterman et al. (2012). In their study the difference between LOG/OLS and best performing non-linear model (ANN) was 6% in the worst case, while in our experiments it was almost 50% difference (l1 feature selection case). Our results therefore strongly support the hypotheses that non-linear relationships in real-world loss given default data sets exist.

Conclusions and Future Research
We have confirmed the results of Loterman et al. (2012), namely that non-linear models perform better for loss given default prediction task. Our result was obtained on a dataset with much more features than the one of Loterman et al. (2012). The performance gap between traditional linear models and non-linear approaches, such as decision trees, was notably bigger than that found in Loterman et al. study. Among all tested models we found the model based on boosted decision trees and l1 feature selection to perform best in our scenario.