News Sentiment in Bankruptcy Prediction Models: Evidence from Russian Retail Companies

This study is aimed at investigating the application of news sentiment analysis to bankruptcy prediction models in the context of the Russian retail sector. We analyse 190 companies: 95 Russian retail companies that went bankrupt in 2015-2019, and 95 non-defaulting analogue companies. This figure was attained from a larger pool of 312 companies retrieved from the Spark database on the basis of analysis of relevant financial data and further validated by the presence of pertinent news media coverage within 3 years of default date. The methodological base of this analysis is the logistic regression approach, used as a benchmark model, and several machine learning models: random forest, support vector machine, and multilayer perceptron. The predictor set applied consists of 34 financial variables and sentiment variables, aggregated using the ‘bag-of-words’ from a total sample of 4877 news articles, from more than 800 distinct online resource locations. We establish a set of hypotheses based on a review of existing literature in the area, and evaluate their accuracy on the basis of our technical analysis. Our results show that sentiment variables are statistically significant, and that adding sentiment variables improves the performance of bankruptcy prediction models. Also, the results indicate some reference characteristics of companies in terms of word-choice and descriptions in the news, indicating word choices correlated with financially stability and those correlated with financially instability.


Introduction
Accurate analyses of a firm's financial stability are essential for multiple aspects of a company's planning and strategic processes, and are relied upon by other market participants, particularly banks. The probability of default, along with loss given default and exposure at default, are essential components in credit risk modelling. The problem of forecasting bankruptcy on the basis of financial reporting is connected with the fact that analysing the actual results of company reporting is possible only in the year subsequent to publication, which means that bankruptcy forecast in the short term is more challenging. Using news media resources for forecasting purposes assists with short term predictions by providing more current and recent data for analysis. The COVID-19 pandemic of 2020 has caused the largest global recession in history. The overall effects of COV-ID- 19 are not yet apparent, but it is already becoming clear the number of bankruptcies will rise enormously. The most novel and accurate methods of bankruptcy prediction are especially relevant today. One of the main trends in bankruptcy prediction today is the application of new sources of information, will leads to an increase in the accuracy of the models. One of the most promising sources of information is textual data, which can be obtained from corporate disclosures, news, and social networks [1]. News sentiment analysis has been successfully used to predict stock price dynamics [2], but at the time of writing only a very small number of articles have been published dedicated to its application to the bankruptcy prediction problem. The advantage of using the news as a source of information is the availability and frequency of updated data, in comparison with traditional financial data and corporate disclosures sources, which are mostly updated once a year or quarterly [3]. Unlike the studies about the impact of the news sentiment on stock prices (where sentiment directly affects the value of shares of companies), the impact of news on the financial instability of a company should be understood as a description of the event which led to certain consequences for the company. To clarify, in the latter event, the news does indirectly affect the probability of default. The methodological base of this work includes machine learning methods such as the random forest (RF) method, support vector machine (SVM) method, multilayer perceptron (MLP) method, and the logistic regression approach, which is to be used as a benchmark model. For the sentiment variable aggregation, 'the-bag-of-words' model [4] is used, along with the Linis Crowd dictionary [5].

Machine learning in bankruptcy prediction models
Machine learning methods have received more attention than statistical methods, and in comparison with linear models, machine learning models provides higher accuracy.
As part of a review of 89 articles related to bankruptcy prediction models, Aziz [6] indicates an average accuracy ratio of 88% for machine learning and 84% for statistical models. Also, machine learning methods do not have heavy restrictions on the entry data and thus are able to capture complex and non-linear patterns. On the other hand, the 'black box' results are not stable, have difficulties with interpretation, and trend towards overfitting [7] According to Shi [8], the three most used methods of machine learning are: 1) Decision tree (DT). Unlike other machine learning methods, this is easy to interpret and can be displayed visually [9]. Tsai [10] claims that DT reaches the highest accuracy in comparison with other methods.
2) Artificial neural network (ANN), which shows a stable level of high performance in fitting nonlinear data, which in turn allows it to deal with very complex patterns. Ciampi [11] shows that ANN outperforms the Multiple Discriminant Analysis (MDA) and LR methods within a sample of 7000 companies, and is good for dealing with data omissions. However, ANNs tend to generalise results, which can lead to overfitting [12]. As such, Ding [13] claims that while ANN identifies only local optimum results, the support vector machine method (see below) succeeded in achieving global optimal results.
3) Support vector machine (SVM). Unlike ANN, SVM controls for errors with regard to generalising. SVM is successfully used in high dimensional nonlinear data and small datasets [14].

Sentiment analysis in finance
Sentiment analysis is a field of research based on methods of natural language processing, dedicated to identifying emotional attitudes either in relation to the subject under discussion in text to the object, or to a text as a whole.
One of the first studies about textual analysis in the field of finance belongs to Kohut [15]. The study suggests the content of the letters from companies' presidents letters differs between companies with high and poor financial performance. According to Kearney [3], the sentiment analysis in finance can be divided into the following groups according to the source of information: • Public corporate disclosures (annual reports and press releases) [16] • The style and content of corporate disclosures signal as to the company's current situation and may contain useful information about future financial performance from the corporation's point of view. The limitation of this source of information is the low frequency of the data. The data is available only for the small number of the biggest companies and the disclosures are made on a quarterly or annual basis. Moreover, companies tend to try to manipulate public opinion to their benefit [17].

Higher School of Economics
• Media, news articles and analysts' reports [2] • Such compositions express observers' opinions about the overall financial and economic conditions, or about a particular industry or company. The advantage the news source is that news media and similarly published articles are available at all times and are frequently updated.
• Internet messages and social media networks [18] • Social media networks are a potentially useful source of textual information because many people spend a considerable amount of time every day on the internet. However, internet messages, as opinions of common people, are among the noisiest sources of information because of the irrationality of such judgments and general unprofessionalism of internet users [3].
• Other or combined sources. [19] The most common methods of sentiment analysis in finance are machine learning [16] and dictionary-based approaches [20]. According to the machine learning approach, the text is divided by tokens: e.g. by sentence, by word, or by combination of words. Each token is labelled with some category title. The machine learning algorithms predict these labels using the set of tokens. The basis of the dictionary-based approach is a predefined dictionary with words arranged according to categories (e.g. positive and negative). Each word from the text is mapped with the dictionary word category. The dictionary-based method is associated with the 'bag of words' [4] because texts are considered to be unsorted sets of words. The dictionary-based methods vary in how the dictionaries are defined and in how each word should be weighted. The issue of the dictionary-based approach is that it is context-dependent: some words may have different tonalities in terms of different topics. This leads to the creation of topic-specific dictionaries, i.e. finance-specific dictionaries. Thus, the finance dictionary by [21] outperforms the traditional nonspecific Harvard dictionary in financial performance prediction and fraud detection [21].
Since the first wave of explosive interest in sentiment analysis, there has been a field of research on the sentiment of English language texts. However, no one has thus far succeeded in creating a successful multilanguage sentiment dictionary. The dictionary of Russian sentiment was created in 2016 by [5]. Each word found in such a dictionary may be weighted equally or have some weighting rule attached. The proportional scheme is called 'term frequency' (TF): whereij n the frequency of the word i in the document j knumber of documents Another weighting is called 'inverse document frequency' (IDF): where N-the number of all documents t df -the number of documents that contain the word w Finally, term frequency -inverse document frequency (TF -IDF) is the result of multiplying (1) by (2): The main idea behind the IDF approach is that the most frequently occurring words are the least informative. [21] argues that the TF -IDF method outperforms the TF method. However, Azam [22] mentions that the TF method performs better on the smaller datasets. The article by Chen [23] suggests a more controlled term weighting method. Mai [24] uses TF -IDF weights for building a deep learning model.

Hypothesis
As a result of our literature review, the following hypotheses are thus articulated:

Hypothesis 1. The TF -IDF word weights significantly increase bankruptcy prediction model performance in comparison with TF word weights.
The TF -IDF is the more widely used weighting scheme [3], whereas TF weighting is more accurate in small datasets [22].

Hypothesis 2. The number of news items has a statistically significant impact on the probability of default.
To test how news items influence the probability of default, first we should check whether news coverage influences the probability of the default separately from the news content.

Hypothesis 3.
The news sentiment has a statistically positive impact on the probability of default. The application of the news sentiment variable significantly increases the model's performance. The evidence of the significance of textual sources other than news in financial insolvency predictions is provided in [25], and [24] Hypothesis 4. Negative news has a more significant effect on the probability of default than positive news does. The positive / negative influence of positive / negative news in the context of stock market activity is confirmed in articles [26], [27]. Leung [28] claims that positive news articles do have influence on the market. Apergis [29] shows that negative news articles influence the stock market more than positive ones.

Methodology and Data
For modeling bankruptcy in this study, we use four simple and effective methods that have already established in Higher School of Economics 11 the above-mentioned literature: logistic regression (LR); random forest (RF); support vector machine (SVM) and multilayer perceptron (MLP). To evaluate the predictive performance of the machine learning analysis, the AUC-ROC performance measurement approach is preferred due to the balance between the true positive and the true negative rate [30].
To reduce the effect of the high variation between different splits and provide robust results, a 5-fold cross validation was conducted. Firstly, the sample was randomly split into 5 equal parts (subsample 1 (S1), … subsample 5 (S5). Then, S2 + S3 + S4 is used as the training set, while S1 is used as the test set. By repeating this step 5 times, each subsample is used as the test set one time. Thus, we get 5 AUC-ROC values and can calculate the mean AUC-ROC and standard deviation. To test the significance of the model performance change, we conduct paired sample t-tests on the AUC-ROC metrics on the different splits and models, following the methodology of [24] and [31]. The analysis is performed in Python using the "scikitlearn" library.

Textual analysis methods
For our Hypothesis 1 (Model 1), the matrices of TF and TF-IDF frequencies of 658 dictionary words as columns and articles as rows are calculated (the 'bag-of-words' method). This array of columns represents the set of predictor variables. The matrix dimension is presented as (number of articles) x (number of dictionary words mentioned in articles). The default flags are duplicated for company, with the number of news articles greater than 1. For Hypothesis 2 -4 (Models 2-4) the news articles are aggregated by company and TF and TF-IDF statistics for all dictionaries are calculated (again, the «bag-ofwords» method). The matrix dimension is represented as (number of companies) x (number of dictionary words mentioned in articles). Next, sentiment variables are calculated as the column sum of the TF and TF-IDF weight arrays. For each company, positive sentiment is the column sum of the weights (either TF or TF-IDF) of positive words; and negative sentiment is the sum of weights (TF or TF-IDF) of negative words multiplied by (-1); sentiment is the sum of positive and negative sentiment. For TF it simply takes the form: For the lexical base for Russian sentiment analysis, the dictionary is obtained from the Linis Crowd dictionary [5], being the first Russian sentiment dictionary. Linis Crowd is a HSE open-source project that contains a sample of internet texts on socio-political topics with user ratings, and a sentiment dictionary based on these texts. The sentiment for each word is based on the average score, which is scaled from -2 to 2. After processing all ratings and deleting neutral words, 2719 words were left. The words with positive sentiment are considered as positive, the words with negative sentiment are considered negative. The lexical base was expanded with 186 antonyms, synonyms, and single-root words. The final dictionary consists of 2906 words in initial form (1027 positive and 1878 negative).

Database
The sample of companies was assembled from the Spark database, based on the following criteria. Russian companies from the retail sector with a default between 2015-2019 (and their non-default pairs) were selected. Financial data from one year before the default was considered. News data was taken for the three year period prior to the default date. The 'size of companies' metric was established in terms of 'micro' , 'small' , 'medium' and 'big' companies, with revenues of more than 50 million RUB. As a result, a sample of 312 (156 default and 156 non-default) companies was collected.

Textual factors
As a source of textual data, Yandex News was preferred to other news aggregators [32], e.g. Google News, Yahoo News, or databases, e.g. Spark or Thomson Reuters [33], for the following reasons: 1) It covers the entire observation period (2012-2019). For example, Google News and Spark store news only for the last year.
2) Yandex News aggregates a lot of online journals, even small regional ones, which results in a high level of news coverage for both small and big Russian private companies.
3) Yandex's advanced algorithms ensure high search relevance for Russian-language queries.
For each company in the sample the following web search query is completed: 1) Query: full company's name in Russian. If there are several companies with a similar name, a key word is added to the company's name (place of the registration or industry).
2) Options: The time period from 3 years before the date of default to the date of default. The total number of articles is 4877, from more than 800 different online journals. News items were found for 95 company pairs out of 156 pairs. Text preprocessing is done using Python libraries: Natural Language Toolkit NLTK 3.4.5 [34], Pymorphy2 0.8 [35], and base Python libraries. After preprocessing, all words are matched with the 'positive' and 'negative' dictionary categories. The most frequent negative words and the most frequent positive words are presented in Figure 1. The word clouds are made with the Python library 'word cloud' resource. From Figure 1, we may see that the most frequent negative words are related to crime and legal issues.

Results
In general, the results are satisfactory. A high average accuracy of bankruptcy prediction is achieved, and some significant information is extracted from the news. The implications for our hypotheses are as follows: Hypothesis 1. The data do not provide enough evidence towards Hypothesis 1 Hypothesis 2 The data do not provide enough evidence towards Hypothesis 1 Hypothesis 3 The hypothesis is not rejected. The news sentiment has a significantly negative impact on the probability of default and adding the news sentiment variable significantly increases the model performance. Hypothesis 4. The hypothesis is not rejected. The negative news sentiment has greater impact on the probability of default than positive news does. Also, in case of SVM and MLP, adding the negative news sentiment variable results in a statistically higher average performance than adding a positive news sentiment variable does. We use 34 financial control variables from Sample 2, and perform a one-factor analysis. After all the transformations, the following factors are included in the model (Model 0): The variables that are included with negative coefficients reduce the risk of default (net income margin and cash ratio) and those that are included with a positive sign increase the risk of default (payables / revenue, and revenue / mid-year inventories). For the benchmark performance, the machine learning analysis for Model 0 is performed: The initial performance of the model is not very high ( Table 2). This may be partly explained by the small sample size and the presence of both small and big companies. The articles with a comparably small dataset (e.g. of 240 variables and accounting-based variables [14]) and [36] with the sample of 107 default companies provide a comparable average AUC-ROC performance of 60-80%.

Logistic regression analysis
The number of news items is not a significant variable (Table 3). So, the data do not provide enough evidence in support of Hypothesis 2. This model is excluded from further analysis in the next subchapter.

Higher School of Economics
News sentiment is a significant coefficient, so the first part of Hypothesis 3 is not rejected under the confidence level more than 99.99% (Table 4). The news sentiment has a significant negative impact on the probability of default, which corresponds to the supposition that the greater the number of positive words (as the characteristics of the positive events) and the less the number of negative words (as the characteristics of the negative events), the lower the probability of default. Negative sentiment is a significant variable under the confidence level of 90%, whereas positive sentiment is not ( Table 5). The absolute value of the coefficient in the logistic regression is more for the negative sentiment. The results of Table 5 permit us to infer that our Hypothesis 4 is not rejected.
One of the main challenges of bankruptcy prediction is the increase in explanatory power of the models. In the next subchapter, the issue of model accuracy changes when news sentiment variables are investigated.

Machine learning analysis
The first step of the analysis is to compare the TF (Model 1a) and TF-IDF (Model 1b) performance. For this purpose, 658 unique dictionary words mentioned in the text are tested as the default predictors with TF and IDF weights.  (Table 6) shows that the mean difference between the performance of Model 1b and 1a is not statistically different from zero, which does not provide any evidence for the acceptance of Hypothesis 1. As TF-IDF statistics are more widely used, in the subsequent models the aggregated sentiment variable with TF-IDF weighting scheme is used.
The average AUC-ROC performance (Table 6), is relatively low: the resulting performance of analysis of the total word collection is comparable to one-factor analysis for financial variables. However, combined with aggregated control variables, the text variable could improve the performance of the model and, more importantly, could lead to the following findings concerning word significance: The presence of legal issues, crime, accidents, debt, conflicts, prohibitions, and low quality of delivered products or services in news articles correlates with companies having some serious problems that may lead to bankruptcy (Table 7):  Adding sentiment variable to the control variables allows us to significantly increase the average AUC-ROC performance (Table 9), which indicates that the second part of Hypothesis 3 is not rejected. The RF reaches its highest accuracy (at 90%) in this particular area of analysis.
The last step of our analysis involves checking whether negative news sentiment and control variables (Model 4b) are correlated with a higher increase in the model quality than the positive news sentiment and control variables (Model 4a) do: Higher School of Economics The results of one-tailed paired sample t-test (Table 6) are ambiguous: under the significant level of 10% the mean difference between Model 4b and 4a performance is statistically greater than zero only for SVM and MLP. All methods of performance are nearly on the same level of accuracy, which is a surprising fact for the LR. The first explanation for this result is that machine learning algorithms have a lower performance at small datasets. Another explanation is that data-preprocessing and feature selection is done according to the logistic regression assumptions and specifics, which may increase the LR performance. The final possible explanation concerns the specifics of the machine model. As such, RF may perform worse than LR if the formula in model training contains a high proportion of essential predicting factors [37]. According to Salazar [38], RF and SVM perform at nearly the same level when the number of predictors is small. Gaudart [39] claims that the neural networks do not outperform linear regression in the case of normality, and the presence of homoscedasticity, the independence of the errors, and our preprocessing methodology, brings our data very close to the normal distribution. The overfitting effect of the small sample size was minimised by application of the 5-fold validation. The small number of predictors is regarded as harmonising in terms of establishing the сomparability of the logistic regression and the machine learning method results.