Loan Default Prediction for Small Industries Bank

Kelsey Heng
Towards Data Science
3 min readOct 19, 2019

--

Photo: https://thelendersnetwork.com/minimum-credit-score-needed-personal-loan/

Banks loan money to companies in exchange for the promise of repayment. Some will default on the loans, being unable to repay them for some reason. The bank maintains insurance to reduce their risk of loss in the event of default. The insured amount may cover all or just some part of the loan amount.

For this assignment, the bank wants to predict which companies will default on their loans based on their financial information. Dataset provided consists of loan related information such as loan amount, term, and state. Also, there is company information such as the number of employees, operating sector, etc.

Goal

To predict if a company will default on their loan, I tried two different machine learning algorithms: Logistic Regression and Random Forest. The instruction of this assignment was to use accuracy as the evaluation metric. However, the precision would be essential in this scenario as we would like to minimize the potential of loan defaults.

Insights / Exploratory Data Analysis

A quick look at the data revealed some insights:
1. Trading companies are the largest pool of customer
2. Smaller companies have a higher tendency to default
3. Term of loans did not affect the likelihood to default
4. There were only half as many clients who defaulted on their loan

With this information, I proceed to clean the data and generate new features using insights. The goal was to predict default status, with 0 as no default and 1 as default.

Model

To determine which information will be essential for the model to perform well, I look at the feature importance and repeated the steps of feature engineering.

Logistic Regression was my first model of choice as it has low time complexity. The model has an accuracy rate of about 70%. This also meant that 13% (109/821) clients did not make any repayment but was predicted to not default, aka false positives.

Random forest was my next model. Its initial accuracy score was 99% but was overfitting that will result in poor prediction when unseen data is introduced (validation was 90%). After tuning the model, the final model has an accuracy of 94% and no longer overfits. Viola! We manage to bring the number of false positives down from 13% to 3%.

Conclusion

There is no one way to determine if a client will stop making repayments. But there are factors such as the term of loan, industry, size of the company that contributes to their potential to make repayments

The codes for this project can be found on my Github. I can be contacted via LinkedIn if you would like to connect.

--

--

Neuroscience researcher turned analytics consultant. Huge love for data storytelling, turning numbers into fun facts!