Applying Complementary Credit Scores to Calculate Aggregate Ranking

Researchers have been improving credit scoring models for decades, as an increase in the predictive ability of scoring even by a small amount can allow financial institutions to avoid significant losses. Many researchers believe that ensembles of classifiers or aggregated scorings are the most effective. However, ensembles outperform base classifiers by thousandths of a percent on unbalanced samples. 
This article proposes an aggregated scoring model. In contrast to previous models, its base classifiers are focused on identifying different types of borrowers. We illustrate the effectiveness of such scoring aggregation on real unbalanced data. 
As the effectiveness indicator we use the performance measure of the area under the ROC curve. The DeLong, DeLong and Clarke-Pearson test is used to measure the statistical difference between two or more areas. In addition, we apply a logistic model of defaults (logistic regression) to the data of company financial statements. This model is usually used to identify default borrowers. To obtain a scoring aimed at non-default borrowers, we employ a modified Kemeny median, which was initially developed to rank companies with credit ratings. Both scores are aggregated by logistic regression. 
Our data Russian banks that existed or defaulted between July 1, 2010, and July 1, 2015. This sample of banks is highly unbalanced, with a concentration of defaults of about 5%. The aggregation was carried out for banks with several ratings. 
We show that aggregated classifiers based on different types of information significantly improve the discriminatory power of scoring even on an unbalanced sample. Moreover, the absolute value of this improvement surpasses all the values previously obtained from unbalanced samples. 
The aggregated scoring and the approach to its construction can be applied by financial institutions to credit risk assessment and as an auxiliary tool in the decision-making process thanks to the relatively high interpretability of the scores.


Introduction
Scoring models have been developing for decades. Researchers have proposed and compared different approaches to data preparation for model construction and approaches to selecting factors which influence credit quality and their generation. They have also studied the best approaches to assessing credit score capability/accuracy and the credit score methods themselves. This was done to improve scoring accuracy, insofar as a gain or loss of a percentage point in accuracy can lead to multimillion profits or losses for banks and other financial institutions [1].
Over the past ten years, scholars have believed that the best practice is to use machine learning models [2] and so-called "ensembles" [3] to construct credit scores. The basic idea of an ensemble lies in the aggregation of modelled base classifiers (scores) with the help of a model/algorithm. There exist different classifications of ensembles [3][4][5]; however, their division into bagging, boosting, and stacking ensembles is the most common. Bagging is the combination of several independent scorings (base classifiers, weak learners) constructed in a parallel way on the basis of independent random samples. Random forests are a well-known example of bagging. Boosting is the aggregation of several successively constructed base scorings. Stacking is the combination of different base classifiers (for example, logistic regression and a decision tree) that are trained simultaneously. They are combined in the ensemble model (strong learner), which includes different voting rules, statistical models and machine learning methods. The ensemble paradigm makes ensembles relevant: several aggregated classifiers usually show greater discriminatory power/accuracy than a single classifier [5]. Nevertheless, some researchers have shown that ensemble models sometimes fail to surpass machine learning methods in regard to certain criteria [4; 6]. Also, their practical applicability is usually limited: in most cases, they are "black boxes" which are difficult to interpret because machine learning methods and other ensembles are often used as the base classifiers. Therefore, some researchers [7] have attempted to simplify the interpretability of ensemble models, including machine learning methods.
In this paper, we focus on using complementary weak learners to calculate aggregate rankings. We chose the logistic model of defaults and a modified Kemeny median [8] as two weak classifiers of this type due to their relatively high interpretability. We consider them to be complementary for our purposes for the following reasons: The logistic model (regression) is usually trained for defining default borrowers using corporate financial statements. In other words, the first weak learner is focused on default borrowers.
The modified Kemeny median has been proposed as a tool for credit rating aggregation. Usually, companies which have a better-than-average creditworthiness want to have credit ratings in particular because they are ready to disclose to a rating agency more information than just financial statements. So, this ranking is potentially aimed at non-default companies.
We propose to use logistic regression as the strong learner. It should be noted that logistic regressions, including ridge and lasso, were used as ensemble models in [1; 9] and proved to be superior to other methods considered in these papers.
Our study is based on a sample of banks during the period between July 1, 2010, and July 1, 2015. This sample is characterized by a low default concentration of 5.76%. Financial performance indicators, identifiers of external or government support, and ratings of credit rating agencies were used to create rankings. It was shown that the aggregation of two base classifiers focused on the identification of different types of borrowers results in an improvement of the predictive power of aggregated credit scoring in comparison to base classifiers.
The interpretability of weak and strong learners makes it possible to use aggregate rankings not only as an additional parameter for decision making in financial institutions but also to evaluate default probability in risk management [10; 11]. The proposed weak learners constitute the scientific novelty of this paper: they were trained using potentially complementary information (ratings and financial statements). We know of only one similar study [12] that trained weak learners using market indicators and financial statements. However, the ensemble did not outperform the base classifier in discriminatory power [12].

Literature Review
The number of papers devoted to credit scoring methods has grown exponentially over the past 30 years [3]. In the last five years, researchers have continued their attempts to improve credit scoring for legal entities [13][14][15]and even more so for financial institutions involved in lending to SMEs. The importance of credit scoring has increased recently because of the financial crisis and increased capital requirements for banks. There are, however, only few studies that develop credit coring models for SME lending. The objective of this study is to introduce a novel, more accurate credit risk estimation approach for SMEs business lending. Based on traditional statistical methods and recent artificial intelligence (AI. However, the majority of papers make use of databases of natural persons [16]. The reason is that such databases are in open access and available for parsing. These samples have been used to compare well-known approaches to credit scoring calculation [17]the volume of databases that financial companies manage is so great that it has become necessary to address this problem, and the solution to this can be found in Big Data techniques applied to massive financial datasets for segmenting risk groups. In this paper, the presence of large datasets is approached through the development of some Monte Carlo experiments using known techniques and algorithms. In addition, a linear mixed model (LMM and propose new ones [18]. Different ensembles [18; 19] and logistic regressions [20] have been identified as the best scoring methods. In addition, papers dedicated to the comparison of well-known methods often consider neural networks [21] and decision trees [22] to be the best.
Such a diversity of best methods is partially explained by the wide range of simultaneously applied classification quality criteria. Many authors [4; 9] agree that it is better to use several model performance measures at once. Nevertheless, other researchers [23; 24] continue to apply only conventional methods calculated on the basis of an error matrix.
In this paper, we propose looking at credit scoring aggregation from a slightly different perspective. Usually, only one type of data is used to create base scorings: financial statements or characteristics of natural persons [25]normally taking between 50% and 80% of the total project time. It is in this stage that data in a relational database are transformed for applying a data mining technique. This stage is a complex task that demands from database designers a strong interaction with experts having a broad knowledge about the application domain. Frameworks aiming to systemize this stage have significant limitations when applied to Credit Behavioral Scoring solutions. This paper proposes a framework based on the Model Driven Development approach to systemize the mentioned stage. This work has three main contributions: 1 or company market indicators [13]and even more so for financial institutions involved in lending to SMEs. The importance of credit scoring has increased recently because of the financial crisis and increased capital requirements for banks. There are, however, only few studies that develop credit coring models for SME lending. The objective of this study is to introduce a novel, more accurate credit risk estimation approach for SMEs business lending. Based on traditional statistical methods and recent artificial intelligence (AI. Indicator categories from financial statements complement each other, and machine learning methods can be applied to assess the nonlinear relations between them. However, the creditworthiness of a company may be characterized by factors that are recorded only partially or not at all in statements. These ratings may potentially complement the indicators of corporate financial statements: companies disclose more information to credit rating agencies (CRAs) than one can find in the public domain [26]. In addition, companies with a better creditworthiness, all other things being equal, tend to resort to CRAs: such companies are developing and need external ratings to expand into new markets, for example. Thus, one may conjecture that the collective opinion of credit rating agencies may complement information from financial statements.
In this paper, we will use classical logistic regression as the base classifier and as the aggregated model. This practice was applied in [9; 27]the sample is class imbalanced. Class imbalance may affect the accuracy of default predictions, as classifiers tend to be biased towards the majority class (good borrowers, which showed the advantage of this approach over base classifiers. 1 URL: https://www.cbr.ru/credit/ In order to calculate base classifiers, a preliminary preparation of data is carried out. One of the stages of preliminary preparation is parameter selection by means of forward feature selection. Nevertheless, it is necessary to describe the data sample before we explain the methodology in detail. This is due to the fact that the choice of methods depends on the data.

Data
The main data pool comprises publicly available information on 958 banks for the period between July 1, 2010, and July 1, 2015, which represents approximately 80% of all banks operating in the Russian Federation during this period. 134 of these banks had two or more ratings calculated by seven credit rating agencies: AK&M, Expert RA (EXP), National Rating Agency (NRA), RusRating (RUS), Fitch Ratings, Moody's Analytics, and Standard & Poor's. This data pool was formally divided into three parts: data on banks up to July 2014, data on banks after July 2014, and data on banks with two and more credit ratings.
Data on banks up to and including July 2014 comprises 70% of the observations of the main pool or 13,570 observations. The default concentration is 4.6%. In terms of default/non-default observations, this sample is highly unbalanced. It comprises indicators from bank report forms 101 and 102 and statutory requirements information (form 135) posted on the website of the Bank of Russia 1 and information on support from the Russian government or foreign banks. This sample was used to train the logistic model of defaults.
Data on banks after July 2014 consists of 4,261 observations with a default concentration of 9.25%. The list of indicators was the same as in the sample described above. This sample was used to test the logistic model of defaults.
Data on banks with two and more credit ratings is part of the two samples described above. This sample consists of observations on 134 banks. The sample size is 1,700 observations, 17 of which are defaults. This sample is also unbalanced and has a default concentration of 2.72%. In addition to the indicators described above, it includes CRA ratings. For the purposes of creating scoring ratings, categories were assigned numerical values, where 0 was attributed to the higher rating category of each credit rating agency (CRA). Then, the numerical value of each lower category was increased by 1. As the last two columns of Table 1 show, the number of assigned rating categories varied greatly from agency to agency. Source: author's calculations.
If we consider previous papers that, in one way or another, studied CRA ratings using Russian data (for example, [28]), we see that the general distribution of agency ratings has changed little. The most frequent ratings are low ratings in the investment grade or best ratings in the speculative grade. The data on ratings is taken from the RUData system 2 . Consensus and aggregate rankings are calculated using this sample.
The low default concentration and small size of the sample of banks with several ratings is insufficient for dividing it into training and test samples to create a logistic model. This is why samples of banks with one or no ratings are used in this study.

Methodology
This chapter consists of several parts. "Logistic Regression" describes the preliminary preparation of data for making a scoring using the logistic model of defaults, the logistic model itself, and ways of validating it. "Modified Kemeny Median" has a similar structure. "Aggregation" describes the mechanism for aggregating the two rankings obtained from the logistic model and the modified Kemeny median. "Model Power Indicator" describes the tool applied to verify the efficiency (power) of obtained rankings.

Logistic Regression
Linear prediction of the logistic model of defaults or the "continuous" rating of the defaults prediction model is used as the first baseline ranking (classifier) [29]. Due to its simplicity, transparency, interpretability and a relatively high discriminatory power, this scoring model continues to be the industry standard [3; 28].
Data preparation. In this paper, observations with missing data were not used for building the logistic regression. Such an approach is frequently used for calculating credit 2 URL: https://rudata.info/ scorings [23; 24], insofar as it does not generate a bias of estimators due to an inappropriately chosen way of imputation of missing values [30]. The forward stepwise selection method was used for features selection for the logistic model. This approach adds a relevant variable to the defined significant variables. If this variable is significant and significantly improves the model, it is also included. In spite of its simplicity, this approach is still widely used to select parameters [16]. In turn, * i y linearly depends on X -factors that may predict the bank creditworthiness. They may be continuous and categorical quantities that represent relevant financial, macroeconomic and other indicators. In this case, the probability of a bank being default or non-default is as follows, respectively: where ' i X is the transposed matrix of factors describing the bank's creditworthiness, ε is an unobservable random component with logistic distribution, and F is a logistic distribution function. The linear predictions are calculated as follows: In this paper, contin R is used as one of the base scorings focused on default borrowers.
Validation. The complete sample of banks is used to build the logistic model, regardless of whether they have a rating or not. This sample is divided into training and test subsamples. This is done on an out-of-time basis and it's no coincidence. Such a validation method is used in credit scoring studies [31; 32].

Modified Kemeny Median
Data preparation. Unlike the previous method, observations with missing data for certain variables were used for building a modified Kemeny median (consensus ranking). To create a consensus ranking, we used the ratings of seven rating agencies operating in Russia from July 2010 to July 2015.

Modified Kemeny median.
Another base classifier is represented by the Kemeny median [8], whose application results in the so-called "consensus ranking". This method is based on the interpretation of credit rating as a relative ranking of objects in accordance with a CRA's opinion on the credit quality of each object. On the basis of the ratings specific nature as expert information, we modified the concept of Kemeni distance between rankings. This made it possible to find a unique solution that least contradicts the opinions of rating agencies with an acceptable accuracy within an acceptable time: where cons R is the resulting (aggregated) rating, m is the number of aggregated ratings, k R is the k th rating, is the rank measure of distance between ratings ' R and '' R (number of contradictory rankings for all pairs of companies), λ is the regularization parameter (relative significance of the secondary criterion), is the additional (secondary) criterion (shows the extent of contradiction significance), depends on its inconsistency with other ratings.
Validation. It is impossible to apply common validation measures such as cross-validation types to this method. The reason is that the modified Kemeny median is a result of a non-parametric approach that cannot be used for another sample directly without mapping.
The Kemeny median was originally a voting method that was subsequently used as an aggregator of credit ratings for banks. The collective opinion of credit rating agencies may complement information from financial statements: companies disclose to credit rating agencies information which may be absent from publicly available data. Thus, it is expected that the combination of the logistic model of defaults built on publicly available data and the ranking obtained from CRA ratings will surpass these base classifiers.

Aggregation
Logistic regression is applied as a strong classifier in this paper. The binary default/non-default variable i y , arranged in the same way as in function (1), still serves as the interpretable factor. However, to create an aggregated scoring, i y is predicted using the following two factors: where j γ is a coefficient obtained from assessing the logistic regression with the help of the maximum likelihood method and F is the logistic distribution function. The aggregated scoring itself is calculated as follows:

Model Power Indicator
We use the indicator of the area under the ROC curve (hereafter, AUCROC) as a measure of the discriminatory power of all scorings. This indicator is appropriate for unbalanced samples -in particular, because it takes different errors into account [1, p. 2]. In addition, this indicator does not underrate or overrate its values due to erroneous classification or default distribution [7, p. 38]. The resulting indicator values should be interpreted as follows: the closer the AUCROC value to 1, the greater the discriminatory power of the credit indicator. This indicator is described in more detail in [33].
The statistical significance of differences between the AU-CROC of base classifiers and the aggregated model is defined by means of the DeLong, DeLong and Clarke-Pearson test [34] at a 10% significance level.

Results
This section deals with the discriminatory powers of credit scorings made with the help of base classifiers and through the aggregation of scorings.

Logistic Model of Defaults
The model was trained on a sample of Russian banks from the period July 1, 2010 -July 1, 2014. The following factors were selected: 1) Ratio of the deposits of a legal entity to its bank assets.
2) Regulatory requirement of "the biggest possible credit risks" Н7.
4) Amount of granted short-term credits.
The AUCROC of the obtained logistic model is equal to 68.26%. In the test sample, the AUCROC is 70.03%. The AUCROC consistency in these two non-overlapping samples indicates that the model has not been retrained. In the sample of banks with two and more ratings, the AUCROC is equal to 71.4% (Figure 1). The quicker growth of ROC diagram at the origin means that the default model defines default banks better. According to the quality criterion of scoring models from [35], this model shows a good discriminatory power from a practical standpoint. This is confirmed by the results of [9; 21], which built logistic regressions using unbalanced samples. In such a case, the AUCROC of the logistic model usually lies in the range 60-74%.

Consensus Ranking
The consensus ranking was calculated on the basis of a sample consisting of banks with two or more ratings. The consensus AUCROC is equal to 71.28% (Figure 2). This ranking defines trustworthy borrowers better, as the right part of the ROC diagram is almost horizontal.
reason for this is that this aggregated rating is based on information about banks which basically have a rating. This is a positive signal for the market: the bank is not afraid of its creditworthiness assessment and can afford it in practice. This ranking also has high discriminatory power from a practical standpoint and is as good as statistical models and machine learning methods in a low-default environment [36; 37]. The consensus ranking is statistically indiscernible at a 10% significance level with a logistic model of defaults according to the DeLong, DeLong and Clarke-Pearson test (p-value = 99.3%).

Aggregate Ranking
The aggregated ranking was built from the two previous rankings. Logistic regression was the aggregated model. We obtained a scoring with the AUCROC equal to 76.16% ( Figure 3).
Statistically, the ranking surpasses the two base scorings at a 10% significance level 3 , showing the relevance of aggregating several baseline rankings and ensembling. In addition, one should note that aggregated scoring includes the best characteristics of both baseline rankings. It defines default and non-default borrowers with similar precision. Moreover, previously proposed versions of aggregate classifiers showed a growth in the AUCROC not exceeding 3% [38; 39] on unbalanced samples. Aggregate ranking, AUCROC=76.16%

Conclusion
Financial institutions need to identify both default and non-default contractors or customers in order to enable their management to take informed decisions when solving risk management problems. In this paper, we propose the aggregation of credit scorings made with methods focused on different types of borrowers: the logistic model of defaults and the modified Kemeny median. Logistic regression is used as the strong learner.
Our data sample consists of Russian banks from the period July 2010 -July 2015, including credit ratings. From a practical standpoint, the discriminatory power of baseline rankings is high and typical for credit scorings in a low-default environment. However, their aggregation using logistic regression resulted in a significant growth in the discriminatory power of scoring. Moreover, this increment surpassed the increments of ensembles or aggregated rankings on unbalanced samples described in earlier literature. As long as the applied classifiers demonstrate a relatively high interpretability, such a model can be also used by financial institutions for risk management.
In further research, feature engineering techniques (for example, principle component analysis) may be applied as explanatory factors, provided the obtained index is interpretable. It is also possible to expand the set of base scorings by adding market scorings and some other interpretable scorings obtained, for example, from discriminant analysis, decision trees, etc.