Ketchbrook Analytics Blog - Making Small Data Big(ger)

One of the caveats of working with your traditional vanilla machine learning algorithms (e.g., linear regression, logistic regression, decision trees, etc.) is that each case (row) in your training dataset must be independent. Let’s explore what exactly this means.

The way those algorithms learn is by looking at a bunch of cases: the circumstances surrounding each case, and what the outcome was for each case. The algorithm then tries to boil all of these cases down to a handful of rules that do a pretty good job at explaining how certain circumstances generally lead to particular outcomes.

Applications in Credit Risk

When we think about credit risk models, the cases are perhaps a bunch of historical loans in your portfolio where we know what the outcome was. To provide a simplified example, let’s suppose we are building a logistic regression model where the possible outcomes are (1) the loan was paid back in full or, (2) the borrower defaulted on the loan.

Loan ID	Debt Coverage Ratio	Credit Score	Industry Outlook	Outcome
1001	1.334	711	Fair	Paid in Full
1002	1.163	760	Poor	Default
1003	1.858	740	Excellent	Paid in Full
1004	0.767	724	Poor	Paid in Full
1005	1.346	712	Above Average	Default
1006	1.966	697	Excellent	Paid in Full
1007	1.502	671	Excellent	Default
1008	1.117	743	Fair	Paid in Full

In order to create the above dataset to train our model, we had to aggregate each loan into a single observation, so that each row represents a unique loan. Remember, each case in our training data must be independent; for us this means that we cannot have any loan appear more than once. There are many approaches to doing this aggregation which we won’t cover today, but for now just remember that the approach taken should be driven by what information will be available at the time of scoring a new loan.

Aggregation is Limiting

When we take the step of aggregating our data into unique loan-level observations, we are naturally reducing the amount of data we have to work with. If you have tons of data, this isn’t an issue. But one issue we run into often is severe class imbalance in our outcome. In other words, we tend to have a lot more “Paid in Full” cases than we have “Default” cases.

“Remember, each case in our training data must be independent; for us this means that we cannot have any loan appear more than once.”

But what if we didn’t have to satisfy that independence assumption? What if we didn’t have to aggregate our data? After all, the original data in our database probably looks a lot more like this:

Loan ID	Date	Debt Coverage Ratio	Credit Score	Industry Outlook	Status
1001	2021-06-01	1.925	749	Excellent	Current
1001	2021-07-01	1.674	705	Good	Current
1001	2021-08-01	1.334	711	Fair	Paid in Full
1002	2021-02-01	1.199	764	Good	Current
1002	2021-03-01	1.163	760	Poor	Default
1003	2021-09-01	0.644	800	Above Average	Current
1003	2021-10-01	2.654	728	Good	Current
1003	2021-11-01	1.858	740	Excellent	Paid in Full

This type of data is sometimes referred to as “longitudinal” data, and represents observations of the same subject(s) over multiple points in time. In our case, the “subjects” are the unique loans. Clearly, the rows in this type of dataset are not independent, since we see the same loan appear more than once.

What’s to be Gained

Suppose the independence condition didn’t exist, and we could use this longitudinal data to build our logistic regression model. What would we gain by doing so?

More Data: For starters, we would have a lot more data! In situations where we don’t have a ton of data to begin with, each row of data we do have is really important. Especially when we have class imbalance in our data – we need as much information about “Default” loans as possible to help our model develop those general rules (and avoid overfitting).
More Signal: Second, we can give our model insight into a loan’s history in a way that we weren’t able to with our aggregated dataset. For example, it’s probably important to distinguish between a loan that defaulted after being on the books for 3 years versus one that defaulted after 3 months. You can think of this as incorporating an entire additional predictor variable into our model.

Enter: Multi-Level Models

Luckily for us data scientists, we know that we have a lot more tools in our toolbox than just the three algorithms mentioned at the beginning of this article. One suite of lesser-known algorithms we might explore are multi-level models.

If you haven’t heard of multi-level models, you may be familiar with “mixed effects” or “hierarchical” models. These three terms all refer to roughly the same thing. The big advantage of this type of model? Each case in your training data does not have to be independent. This means that we can use a dataset that looks a lot more like the second table above, as opposed to the (aggregated) first table.

Analogous Algorithms

Fortunately for us, a lot of the more traditional algorithms have multi-level analogs.

Austin Powers Scene, “We’re Not So Different, You and I”

In fact, there are multi-level and mixed effects flavors of logistic regression that allow you to accommodate dependence between rows in your training data.

In our next blog post, we will dive deeper into the technical approaches to implementing these kinds of algorithms for building better credit risk models.

Interested in Learning More?

Get in touch with us today at info@ketchbrookanalytics.com