Making Small Data Big(ger)

Don’t let your model choice shrink your data.
Author

Michael Thomas

Published

February 19, 2022

One of the caveats of working with your traditional vanilla machine learning algorithms (e.g., linear regression, logistic regression, decision trees, etc.) is that each case (row) in your training dataset must be independent. Let’s explore what exactly this means.

The way those algorithms learn is by looking at a bunch of cases: the circumstances surrounding each case, and what the outcome was for each case. The algorithm then tries to boil all of these cases down to a handful of rules that do a pretty good job at explaining how certain circumstances generally lead to particular outcomes.

Applications in Credit Risk

When we think about credit risk models, the cases are perhaps a bunch of historical loans in your portfolio where we know what the outcome was. To provide a simplified example, let’s suppose we are building a logistic regression model where the possible outcomes are (1) the loan was paid back in full or, (2) the borrower defaulted on the loan.

Loan ID Debt Coverage Ratio Credit Score Industry Outlook Outcome
1001 1.334 711 Fair Paid in Full
1002 1.163 760 Poor Default
1003 1.858 740 Excellent Paid in Full
1004 0.767 724 Poor Paid in Full
1005 1.346 712 Above Average Default
1006 1.966 697 Excellent Paid in Full
1007 1.502 671 Excellent Default
1008 1.117 743 Fair Paid in Full

In order to create the above dataset to train our model, we had to aggregate each loan into a single observation, so that each row represents a unique loan. Remember, each case in our training data must be independent; for us this means that we cannot have any loan appear more than once. There are many approaches to doing this aggregation which we won’t cover today, but for now just remember that the approach taken should be driven by what information will be available at the time of scoring a new loan.

Aggregation is Limiting

When we take the step of aggregating our data into unique loan-level observations, we are naturally reducing the amount of data we have to work with. If you have tons of data, this isn’t an issue. But one issue we run into often is severe class imbalance in our outcome. In other words, we tend to have a lot more “Paid in Full” cases than we have “Default” cases.

“Remember, each case in our training data must be independent; for us this means that we cannot have any loan appear more than once.”

But what if we didn’t have to satisfy that independence assumption? What if we didn’t have to aggregate our data? After all, the original data in our database probably looks a lot more like this:

Loan ID Date Debt Coverage Ratio Credit Score Industry Outlook Status
1001 2021-06-01 1.925 749 Excellent Current
1001 2021-07-01 1.674 705 Good Current
1001 2021-08-01 1.334 711 Fair Paid in Full
1002 2021-02-01 1.199 764 Good Current
1002 2021-03-01 1.163 760 Poor Default
1003 2021-09-01 0.644 800 Above Average Current
1003 2021-10-01 2.654 728 Good Current
1003 2021-11-01 1.858 740 Excellent Paid in Full

This type of data is sometimes referred to as “longitudinal” data, and represents observations of the same subject(s) over multiple points in time. In our case, the “subjects” are the unique loans. Clearly, the rows in this type of dataset are not independent, since we see the same loan appear more than once.

What’s to be Gained

Suppose the independence condition didn’t exist, and we could use this longitudinal data to build our logistic regression model. What would we gain by doing so?

  • More Data: For starters, we would have a lot more data! In situations where we don’t have a ton of data to begin with, each row of data we do have is really important. Especially when we have class imbalance in our data – we need as much information about “Default” loans as possible to help our model develop those general rules (and avoid overfitting).
  • More Signal: Second, we can give our model insight into a loan’s history in a way that we weren’t able to with our aggregated dataset. For example, it’s probably important to distinguish between a loan that defaulted after being on the books for 3 years versus one that defaulted after 3 months. You can think of this as incorporating an entire additional predictor variable into our model.

Enter: Multi-Level Models

Luckily for us data scientists, we know that we have a lot more tools in our toolbox than just the three algorithms mentioned at the beginning of this article. One suite of lesser-known algorithms we might explore are multi-level models.

If you haven’t heard of multi-level models, you may be familiar with “mixed effects” or “hierarchical” models. These three terms all refer to roughly the same thing. The big advantage of this type of model? Each case in your training data does not have to be independent. This means that we can use a dataset that looks a lot more like the second table above, as opposed to the (aggregated) first table.

Analogous Algorithms

Fortunately for us, a lot of the more traditional algorithms have multi-level analogs.

Austin Powers Scene, “We’re Not So Different, You and I”

In fact, there are multi-level and mixed effects flavors of logistic regression that allow you to accommodate dependence between rows in your training data.

In our next blog post, we will dive deeper into the technical approaches to implementing these kinds of algorithms for building better credit risk models.

Interested in Learning More?

Get in touch with us today at info@ketchbrookanalytics.com