Default risk is the chance that companies or individuals will be unable to make the required payments on their debt obligations. In other words, credit default risk is the probability that if you lend money, there is a chance that they won’t be able to give the money back on time. Lenders and investors are exposed to default risk in virtually all forms of credit extensions. To mitigate the impact of default risk, lenders often impose charges that correspond to the debtor’s level of default risk. A higher level of risk leads to a higher required return.
Traditionally, default risk is gauged using standard measurement tools, including FICO scores for consumer credit, and credit ratings for corporate and government debt issues. Credit ratings for debt issues are provided by nationally recognized statistical rating organizations (NRSROs), such as Standard & Poor’s (S&P), Moody’s and Fitch Ratings.
Predicting Credit Default Risk with Machine Learning
Developments in machine learning and deep learning have made it much easier for companies and individuals to build a high-performance credit default risk prediction model for their own use.
If you are familiar with machine learning, and with classification problems, in particular, you will see that the credit default risk prediction problem is nothing but a binary classification problem. So any machine learning method that could be used for binary classification problems can be applied to credit default risk prediction problems as well.
The success of a machine learning model, however, does not depend solely on the selection of a machine learning method. Key factors contributing to the success of the machine learning model include:
Data is the very prerequisite for any successful machine learning model. No matter how great your machine learning models are, you cannot get a reliable high-performance model from the prediction model without a sufficient amount of rich data.
Processing raw data and making it a suitable input for the machine learning models includes data cleaning, creating new features, and feature selection. Feature engineering usually is the most time-consuming machine learning problem, especially when it comes to building prediction models for structured data.
Even though there are many machine learning methods available for certain machine learning problems, such as binary classification, for example, each method has its own strengths and weaknesses. Based on our demands and requirements, we may need to choose different methods.
Given two machine learning methods, how do we evaluate them to select the better one? We need well-designed performance metrics based on our dataset and experience. For example, AUC and F1 Score are typically used for unbalanced data and binary classification problems.
What Data Do We Need?
As explained, data is the prerequisite for any successful machine learning model. In terms of credit default risk prediction, we need at least the transaction, credit-bureau, and account-balance data that allows us to compute and update measures of consumer credit-risk much more frequently than the sluggish credit-scoring models currently employed in the industry and by regulators. Use the Home Credit Default Risk as an example. The data contains:
- This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
- Static data for all applications. One row represents one loan in our data sample.
- All previous client credits provided by other financial institutions that have been reported to the Credit Bureau (for clients who have a loan in our sample).
- For every loan in our sample, there are as many rows as the number of credits the client had had in the Credit Bureau before the application date.
- Monthly balances of previous credits in the Credit Bureau.
- This table has one row for each month of history of every previous credit reported to the Credit Bureau – i.e the table has (#loans in sample # of relative previous credits # of months where we have some history observable for the previous credits) rows.
- Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
- This table has one row for each month of history of every previous credit in the Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credits # of months in which we have some history observable for the previous credits) rows.
- Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
- This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credit cards # of months where we have some history observable for the previous credit card) rows.
- All previous applications for Home Credit loans of clients who have loans in our sample.
- There is one row for each previous application related to loans in our data sample.
- Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
- There is a) one row for every payment that has been made plus b) one row for each missed payment.
- One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
We can see that the data can be divided into three categories:
- Applicant-level data which contains information about the applicant, such as education, number of family members, car owned, etc.
- Bureau-level data which provides historical transactional information and credit balance information.
- Other data, including external data from other data sources such as credit scores from other platforms, etc.
In principle, the more and the richer data we have, the better the credit default risk prediction model we can build.
What Methods Can Be Used?
As explained, because credit default risk prediction itself is a binary classification problem, any machine learning method than can be used in binary classification problems is theoretically applicable. But each algorithm has its own strengths and weakness. In this part, we will focus on three main algorithms: logistic regression, decision tree, and gradient boosted decision tree, to discuss the strengths and weaknesses of each algorithm in terms of credit default risk prediction problems.
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.
Logistic Regression is one of the most popular ways to fit models for categorical data, especially for binary response data in data modeling. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, these probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hints about the relative importance of each input variable.
Logistic Regression is used when the dependent variable (target) is categorical. And in credit default risk prediction, our target variable is binary: 1 if fail to give back, 0 otherwise.
Pros and Cons of Logistic Regression
The strength of logistic regression can be summarized as follows:
- Simple, fast and low-memory usage. In the logistic model, for each feature variable, there is only one corresponding weight variable. So regardless of whether you update them during training or apply the model in prediction, it will be very fast and with low memory demand.
- Interpretable. It is easy to see the effects of each feature of the model, which is very important in finance and is also one of the reasons why the model is still widely used today.
- With good feature engineering, the performance can also be really good.
- It is easy to convert the model result to a specific strategy and deploy.
But no model is perfect. Speaking of the weaknesses of logistic regression, they can be listed as follows:
- Easy to underfit. Also, compared to ensembling models, the performance is not that good.
- High demand for data, sensitive to missing values, anomaly values, and unable to process non-linear features. This means data cleaning and feature engineering will cost quite a lot of time.
- Not good at dealing with unbalanced data, high-dimension feature set, and categorical features.
So if your main concerns are stability, simplicity, and interpretability, logistic regression, though simple, is still a good choice.
A decision tree is a tree where each node represents a feature (attribute), each link (branch) represents a decision(rule), and each leaf represents an outcome (categorical or continuous value). There is quite a handful of articles about decision trees. For example, this article gives you a detailed explanation about decision trees, including information on what’s a decision tree, how to generate trees, how to do pruning, and why should we use decision trees.
Pros and Cons of Decision Tree
Like logistic regression, the decision tree method has its strengths and weakness. And the pros can be described as follows:
- Easy to understand (if…then… rules-like structure) and interpretable.
- Less data pre-processing is needed compared to logistic regression. No need to do feature discretization and data normalization.
- The best existing algorithm for processing non-linear relationships.
There are also some weaknesses:
- It’s easy to generate extremely complex tree structures, which leads to overfitting.
- It’s not a good choice for high-dimension data.
- Decision trees have poor generalization capability, can’t deal with values that are not shown in the training dataset.
The decision tree is a simple but powerful machine learning model. Unlike logistic regression, it can deal with missing values. Data normalization is not needed. And although the decision tree method itself has the weaknesses mentioned above, some tree-based methods like random forest or gradient boosted decision tree methods have solved most of the mentioned problems and even bring some more power to the decision tree method.
Unlike logistic regression and decision tree methods, ensembling learning is a method that combines the predictions of several base estimators built with a given learning algorithm in order to improve generalizability/robustness over a single estimator.
The three most popular methods for combining the predictions from different models are:
- Bagging: sample from dataset, build base models and then combine all base models. For classification, use majority voting and for regression, use averaging, for example, random forest.
- Boosting: The training process is stepped, the base model is trained in order. The first base model is trained, and the sample is adjusted according to the performance of the base model. The base model predicts the wrong sample and puts more attention on it. Then, the adjusted base sample is used to train the next base model. The above process is repeated N times, and the N base models are weighted and combined to output the final result. Commonly used algorithms are GBDT and XGBOOST.
- Stacking: A method of combining models. Taking two layers as an example, the first layer is composed of multiple base learners, the input is the original training set, and the second layer is based on the output of the first layer base learner. The training set is retrained to obtain a complete stacking model.
Pros and Cons of Ensembling Learning
Advantages of the ensemble algorithm:
- Ensemble is a proven method for improving the accuracy of the model. For example, in kaggle, which is the most popular data science platform in the world, almost all winning solutions in structured data modeling challenges have been using ensembling methods such as XGBoost and LightGBM.
- Ensemble makes the model more robust and stable, thus ensuring decent performance on the test cases in most scenarios.
- You can use ensemble to capture linear and simple as well as non-linear complex relationships in the data. This can be done by using two different models and forming an ensemble of the two.
Disadvantages of the ensemble algorithm:
- Ensemble reduces the model interpretability and makes it difficult to draw any crucial business insights. This is not good news when it comes to applications in finance, for example, credit default risk prediction.
- It is time-consuming and thus might not be the best idea for real-time applications. This should not be an issue in terms of credit default risk prediction, in which real-time prediction is not needed.
- The selection of models for creating an ensemble, however, has been made easier nowadays thanks to libraries such as XGBoost and LightGBM.
In terms of Home Credit Default Risk, we can find that winning solutions are using a LightGBM model as one of their core models. This ensembling method is fast and almost always delivers high-level performance, compared to other non-ensembling methods. So if performance and stability are your most important criteria, the ensembling method is for you.
What is the AUC – ROC Curve?
The AUC – ROC curve is a performance measurement for classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells you to what extent the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. By analogy, the higher the AUC, the better the model is at distinguishing between patients with the disease and with no disease.
The ROC curve is plotted with TPR (True Positive Rate, which is True Positive / [True Positive + False Negative]) against the FPR (False Positive Rate, which is [False Positive / False Positive + True Negative]) where TPR is on the y-axis and FPR is on the x-axis. ROC, other than accuracy, is widely used in imbalanced data. This is the case for credit default risk prediction because it is a combination of precision and recall. And compared to the F1 Score, which is also a comprehensive representation of precision and recall, for ROC, you don’t need to manually choose one single threshold for the prediction probability to decide if the prediction output is positive or negative.
Performance Comparison of Different Methods
This article compares the performance of four different algorithms in a credit risk modeling problem. This table shows that no matter what feature set we use, ensembling methods (random forest and boosting) always give the best performance, while the GAM model (a variant of the linear regression model) always gives the worst performance. But another thing we can find is that the performance difference between different algorithms while using the same feature set, which is within 4%, is much smaller than the improvement when we add more features, in this case, behavioral features.
Using one performance metric is not always enough. One machine learning method may be better than another in terms of one performance metric but worse with regard to other performance metrics. In this paper, the author compares the performance metrics (AUC and RMSE) using four algorithms: logistic regression (M1), random forest (M2), gradient boosting model (M3), and neural networks with different settings (D1, D2, D3, D4).
AUC and RMSE in Comparison
As we can see, both in terms of AUC and RMSE, ensembling methods (random forest and gradient boosting model) are much better than other models. And the performance of neural networks is not stable. The difference between those different neural networks is the number of layers and the number of neurons in each layer. The performance difference between them can as big as 0.17 in terms of AUC and 0.2 in terms of RMSE. Another thing we can find is that D3 is the best model in terms of AUC among the four different kinds of neural networks, but the worst in terms of RMSE. This is exactly why it is better to have more than one performance metric to evaluate on the method. We can see that ensembling methods are stable in terms of both performance metrics.
The author also makes another comparison, in which case, all models are trained using only some of the most important features (originally 181 features, now 10 features).
From this comparison table, we can get the following conclusion:
- Performance drop for ensembling models (random forest, gradient boosting) are much smaller, compared to the drops of other machine learning methods (logistic regression, neural network).
- Using the ensembling method, with a small feature set, one can achieve relatively high-performance machine learning, whereas other machine learning methods need more feature engineering or more collected features to get a well-performing model.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is an art. And normally for one data science project, especially in a data science challenge, feature engineering will cost most of the time.
You may ask why we need feature engineering? Well, feature engineering transforms raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
Here we still use Home Credit Default Risk as an example. Feature engineering can be divided into three categories:
- Intersection of different features: For example, if we are given AMT_ANNUITY (the annuity of each credit loan), AMT_INCOME_TOTAL (total income of the applicant per annum), DAYS_EMPLOYED (total days of being employed), then we can create some intersected features, such as AMT_INCOME_TOTAL / DAYS_EMPLOYED and AMT_CREDIT / AMT_INCOME_TOTAL, which may add more information to this model and add some nonlinear capability.
- Aggregations: Usually, we create groups based on certain features. Then we extract some statistical features, like maximal, minimum, mean and standard deviation values.
- Feature engineering is a flexible and task-oriented job. One can try whatever seems reasonable and see if it works.
In this article, we introduced the concept of credit default risk. Further, we sketched out different methods that have been applied historically. Also, we discussed what factors should be considered when building a successful machine learning model. Then we focused on three main methods that are widely used in today’s industry. We compared their strengths and weaknesses to advice on how to choose the most suitable method based on current needs.
Want to start a conversation? Get in touch with our consulting team — we are happy to discuss your case.
- Credit Risk Analysis Using Machine and Deep Learning Models
- Default Risk
- LightGBM 7th Place Solution
- Home Credit Default Risk
- Machine Learning: Challenges, Lessons, and Opportunities in Credit Risk Modeling
- Consumer Credit Risk Models via Machine-Learning Algorithms
- Decision Tree
- AREA UNDER THE RECEIVER OPERATOR CURVE (AUC)
- Area Under the ROC Curve — Explained
- An Intro to Ensemble Learning in Machine Learning
- Chapter 4: Decision Trees Algorithms
- Logistic Regression
About Record Evolution
We are a Data Science & IoT team based in Frankfurt am Main, committed to helping companies of all sizes to innovate at scale. So we’ve built an easy-to-use development platform enabling everyone to benefit from the powers of IoT and AI.