Applying Machine Learning to Conventional Model Validations
In addition to transforming the way in which financial institutions approach predictive modeling, machine learning techniques are beginning to find their way into how model validators assess conventional, non-machine-learning predictive models. While the array of standard statistical techniques available for validating predictive models remains impressive, the advent of machine learning technology has opened new avenues of possibility for expanding the rigor and depth of insight that can be gained in the course of model validation. In this blog post, we explore how machine learning, in some circumstances, can supplement a model validator’s efforts related to:
- Outlier detection on model estimation data
- Clustering of data to better understand model accuracy
- Feature selection methods to determine the appropriateness of independent variables
- The use of machine learning algorithms for benchmarking
- Machine learning techniques for sensitivity analysis and stress testing
Conventional model validations include, when practical, an assessment of the dataset from which the model is derived. (This is not always practical—or even possible—when it comes to proprietary, third-party vendor models.) Regardless of a model’s design and purpose, virtually every validation concerns itself with at least a cursory review of where these data are coming from, whether their source is reliable, how they are aggregated, and how they figure into the analysis.
Conventional model validation techniques sometimes overlook (or fail to look deeply enough at) the question of whether the data population used to estimate the model is problematic. Outliers—and the effect they may be having on model estimation—can be difficult to detect using conventional means. Developing descriptive statistics and identifying data points that are one, two, or three standard deviations from the mean (i.e., extreme value analysis) is a straightforward enough exercise, but this does not necessarily tell a modeler (or a model validator) which data points should be excluded.
Machine learning modelers use a variety of proximity and projection methods for filtering outliers from their training data. One proximity method employs the K-means algorithm, which groups data into clusters centered around defined “centroids,” and then identifies data points that do not appear to belong to any particular cluster. Common projection methods include multi-dimensional scaling, which allows analysts to view multi-dimensional relationships among multiple data points in just two or three dimensions. Sophisticated model validators can apply these techniques to identify dataset problems that modelers may have overlooked.
The tendency of data to cluster presents another opportunity for model validators. Machine learning techniques can be applied to determine the relative compactness of individual clusters and how distinct individual clusters are from one another. Clusters that do not appear well defined and blur into one another are evidence of a potentially problematic dataset—one that may result in non-existent patterns being identified in random data. Such clustering could be the basis of any number of model validation findings.
Feature (Variable) Selection
What conventional predictive modelers typically refer to as variables are commonly referred to by machine learning modelers as features. Features and variables serve essentially the same function, but the way in which they are selected can differ. Conventional modelers tend to select variables using a combination of expert judgment and statistical techniques. Machine learning modelers tend to take a more systematic approach that includes stepwise procedures, criterion-based procedures, lasso and ridge regresssion and dimensionality reduction. These methods are designed to ensure that machine learning models achieve their objectives in the simplest way possible, using the fewest possible number of features, and avoiding redundancy. Because model validators frequently encounter black-box applications, directing applying these techniques is not always possible. In some limited circumstances, however, model validators can add to the robustness of their validations by applying machine learning feature selection methods to determine whether conventionally selected model variables resemble those selected by these more advanced means (and if not, why not).
Identifying and applying an appropriate benchmarking model can be challenging for model validators. Commercially available alternatives are often difficult to (cost effectively) obtain, and building challenger models from scratch can be time-consuming and problematic—particularly when all they do is replicate what the model in question is doing.
While not always feasible, building a machine learning model using the same data that was used to build a conventionally designed predictive model presents a “gold standard” benchmarking opportunity for assessing the conventionally developed model’s outputs. Where significant differences are noted, model validators can investigate the extent to which differences are driven by data/outlier omission, feature/variable selection, or other factors.
Sensitivity Analysis and Stress Testing
The sheer quantity of high-dimensional data very large banks need to process in order to develop their stress testing models makes conventional statistical analysis both computationally expensive and problematic. (This is sometimes referred to as the “curse of dimensionality.”) Machine learning feature selection techniques, described above, are frequently useful in determining whether variables selected for stress testing models are justifiable.
Similarly, machine learning techniques can be employed to isolate, in a systematic way, those variables to which any predictive model is most and least sensitive. Model validators can use this information to quickly ascertain whether these sensitivities are appropriate. A validator, for example, may want to take a closer look at a credit model that is revealed to be more sensitive to, say, zip code, than it is to credit score, debt-to-income ratio, loan-to-value ratio, or any other individual variable or combination of variables. Machine learning techniques make it possible for a model validator to assess a model’s relative sensitivity to virtually any combination of features and make appropriate judgments.
Model validators have many tools at their disposal for assessing the conceptual soundness, theory, and reliability of conventionally developed predictive models. Machine learning is not a substitute for these, but its techniques offer a variety of ways of supplementing traditional model validation approaches and can provide validators with additional tools for ensuring that models are adequately supported by the data that underlies them.
Permissioned Blockchains–A Quest for Consensus
Conspicuously absent from all the chatter around blockchain’s potential place in structured finance has been much discussion around the thorny matter of consensus. Consensus is at the heart of all distributed ledger networks and is what enables them to function without a trusted central authority. Consensus algorithms are designed to prevent fraud and error. With large, public blockchains, achieving consensus—ensuring that all new information has been examined before is universally accepted—is relatively straightforward. It is achieved either by performing large amounts of work or simply by members who collectively hold a majority stake in the blockchain.
However, when it comes to private (or “permissioned”) blockchains with a relatively small number of interested parties—the kind of blockchains that are currently poised for adoption in the structured finance space—the question of how to obtain consensus takes on an added layer of complexity. Restricting membership greatly reduces the need for elaborate algorithms to prevent fraud on permissioned blockchains. Instead, these applications must ensure that complex workflows and transactions are implemented correctly. They must provide a framework for having members agree to the very structure of the transaction itself. Consensus algorithms complement this by ensuring that the steps performed in verifying transaction data is agreed upon and verified.
With widespread adoption of blockchain in structured finance appearing more and more to be a question of when rather than if, SmartLink Labs, a RiskSpan fintech affiliate, recently embarked on a proof of concept designed to identify and measure the impact of the technology across the structured finance life cycle. The project took a holistic approach, looking at everything from deal issuance to bondholder payments. We sought to understand the benefits, how various roles would change, and the extent to which certain functions might be eliminated altogether. At the heart of virtually every lesson we learned along the way was a common, overriding principle: consensus is hard.
Why is Consensus Hard?
Much of blockchain’s appeal to those of us in the structured finance arena has to do with its potential to lend visibility and transparency to complicated payment rules that govern deals along with dynamic borrower- and collateral-level details that evolve over the lives of the underlying loans. Distributed ledgers facilitate the real-time sharing of these details across all relevant parties—including loan originators, asset servicers, and bond administrators—from deal issuance through the final payment on the transaction. The ledger transactions are synchronized to ensure that ledgers only update when the appropriate participants approve transactions. This is the essence of consensus, and it seems like it ought to be straightforward.
Imagine our surprise when one of the most significant challenges our test implementation encountered was designing the consensus algorithm. Unlike with public blockchains, consensus in a private, or “permissioned,” blockchain is designed for a specific business purpose where the counterparties are known. However, to achieve consensus, the data posted to the blockchain must be verified in an automated manner by the relevant parties to the transaction. One of the challenges with the data and rules that govern most structured transactions is that it is (at best) only partially digital. We approached our project with the premise that most business terms can be translated into a series of logical statements in the form of computer code. Translating unstructured data into structured data in a fully transparent way is problematic, however, and limitations to transparency represent a significant barrier to achieving consensus. In order for a distributed ledger to work in this context, all transaction parties need to reach consensus around how the cash will flow and numerous other business rules throughout the process.
A Potential Solution for Structured Finance
To this end, our initial prototype seeks to test our consensus algorithm on the deal waterfall model. If the industry can move to a process where consensus of the deal waterfall model is achieved at deal issuance, the model posted to the blockchain can then serve as an agreed-upon source of truth and perpetuate through the life of the security—from loan administration to master servicer aggregation and bondholder payments. This business function alone could save the industry countless hours and effectively eliminate all of today’s costs associated with having to model and remodel each transaction multiple times.
Those of us who have been in the structured finance business for 25 years or more know how little the fundamental business processes have evolved. They remain manual, governed largely by paper documents, and prone to human error.
The mortgage industry has proven to be particularly problematic. Little to no transparency in the process has fostered a culture of information asymmetry and general mistrust which has predictably given rise to the need to have multiple unrelated parties double-checking data, performing due diligence reviews on virtually all loan files, validating and re-validating cash flow models, and requiring costly layers of legal payment verification. Ten or more parties might contribute in one way or another to verifying and validating data, documents, or cash flow models for a single deal. Upfront consensus via blockchain holds the potential to dramatically reduce or even eliminate almost all of this redundancy.
Transparency and Real-Time Investor Reporting
The issuance process, of course, is only the beginning. The need for consensus does not end when the cash flow model is agreed to and the deal is finalized. Once we complete a verified deal, the focus of our proof of concept will shift to the monthly process of investor reporting and corresponding payments to the bond holders.
The immutability of transactions posted to the ledger is particularly valuable because of the unmistakable audit trail it creates. Rather than compelling master servicers to rely on a monthly servicing snapshot “tape” to try and figure out what happened to a severely delinquent loan with four instances of non-sufficient funds, a partial payment in suspense, and an interest rate change somewhere in the middle. Putting all these transactions on a blockchain creates a relatively straightforward sequence of transactions that everyone can decipher.
Posting borrower payments to a blockchain in real time will also require consensus among transaction parties. Once this is achieved, the antiquated notion of monthly investor reporting will become obsolete. The potential ramifications of this extend to timing of payments to bond holders. No longer needing to wait until the next month to find out what borrowers did the month before means that payments to investors might be accelerated and, in the private-label security markets, perhaps even more often than monthly. With real-time consensus comes the possibility of far more flexibility for issuers and investors in designing the timing of cash flows should they elect to pursue it.
This envisioned future state is not without its detractors. Some ask why servicers would opt for more transparency when they already encounter more scrutiny and criticism than they would like. In many cases, however, it is the lack of transparency, more than a servicer’s actions themselves, that invite the unwanted scrutiny. Servicers that move beyond reporting monthly snapshots and post comprehensive loan activity to a blockchain stand to reap significant competitive advantages. Because of the real-time consensus and sharing of dynamic loan reporting data (and perhaps of accelerated bond payments, as suggested above) investors will quickly gravitate toward deals that are administered by blockchain-enabled servicers. Sooner or later, servicers who fail to adapt will find themselves on the outside looking in.
Less Redundancy; More Trust
Much of blockchain’s appeal is bound up in the promise of an environment in which deal participants can gain reasonable assurance that their counterparts are disclosing information that is both accurate and comprehensive. Visibility is an important component of this, but ultimately, achieving consensus that what is being done is what ought to be done will be necessary in order to fully eliminate redundant functions in business processes and overcome information asymmetry in the private markets. Sophisticated, well-conceived algorithms that enable private parties to arrive at this consensus in real time will be key.
One of the enduring lessons of our structured finance proof of concept is that consensus is necessary throughout a transaction’s life. The market (i.e., issuers, investors, servicers, and bond administrators) will ultimately determine what gets posted to a blockchain and what remains off-chain, and more than one business model will likely evolve. As data becomes more structured and more reliable, however, competitive advantages will increasingly accrue to those who adopt consensus algorithms capable of infusing trust into the process. The failure of the private-label MBS market to regain its pre-crisis footing is, in large measure, a failure of trust. Nothing repairs trust like consensus.
Hands-On Machine Learning–Predicting Loan Delinquency
The ability of machine learning models to predict loan performance makes them particularly interesting to lenders and fixed-income investors. This expanded post provides an example of applying the machine learning process to a loan-level dataset in order to predict delinquency. The process includes variable selection, model selection, model evaluation, and model tuning.
The data used in this example are from the first quarter of 2005 and come from the publicly available Fannie Mae performance dataset. The data are segmented into two different sets: acquisition and performance. The acquisition dataset contains 217,000 loans (rows) and 25 variables (columns) collected at origination (Q1 2005). The performance dataset contains the same set of 217,000 loans coupled with 31 variables that are updated each month over the life of the loan. Because there are multiple records for each loan, the performance dataset contains approximately 16 million rows.
For this exercise, the problem is to build a model capable of predicting which loans will become severely delinquent, defined as falling behind six or more months on payments. This delinquency variable was calculated from the performance dataset for all loans and merged with the acquisition data based on the loan’s unique identifier. This brings the total number of variables to 26. Plenty of other hypotheses can be tested, but this analysis focuses on just this one.
1 Variable Selection
An overview of the dataset can be found below, showing the name of each variable as well as the number of observations available
Count LOAN_IDENTIFIER 217088 CHANNEL 217088 SELLER_NAME 217088 ORIGINAL_INTEREST_RATE 217088 ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB) 217088 ORIGINAL_LOAN_TERM 217088 ORIGINATION_DATE 217088 FIRST_PAYMENT_DATE 217088 ORIGINAL_LOAN-TO-VALUE_(LTV) 217088 ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV) 217074 NUMBER_OF_BORROWERS 217082 DEBT-TO-INCOME_RATIO_(DTI) 201580 BORROWER_CREDIT_SCORE 215114 FIRST-TIME_HOME_BUYER_INDICATOR 217088 LOAN_PURPOSE 217088 PROPERTY_TYPE 217088 NUMBER_OF_UNITS 217088 OCCUPANCY_STATUS 217088 PROPERTY_STATE 217088 ZIP_(3-DIGIT) 217088 MORTGAGE_INSURANCE_PERCENTAGE 34432 PRODUCT_TYPE 217088 CO-BORROWER_CREDIT_SCORE 100734 MORTGAGE_INSURANCE_TYPE 34432 RELOCATION_MORTGAGE_INDICATOR 217088
Most of the variables in the dataset are fully populated, with the exception of DTI, MI Percentage, MI Type, and Co-Borrower Credit Score. Many options exist for dealing with missing variables, including dropping the rows that are missing, eliminating the variable, substituting with a value such as 0 or the mean, or using a model to fill the most likely value.
The following chart plots the frequency of the 34,000 MI Percentage values.
The distribution suggests a decent amount of variability. Most loans that have mortgage insurance are covered at 25%, but there are sizeable populations both above and below. Mortgage insurance is not required for the majority of borrowers, so it makes sense that this value would be missing for most loans. In this context, it makes the most sense to substitute the missing values with 0, since 0% mortgage insurance is an accurate representation of the state of the loan. An alternative that could be considered is to turn the variable into a binary yes/no variable indicating if the loan has mortgage insurance, though this would result in a loss of information.
The next variable with a large number of missing values is Mortgage Insurance Type. Querying the dataset reveals that that of the 34,400 loans that have mortgage insurance, 33,000 have type 1 borrower paid insurance and the remaining 1,400 have type 2 lender paid insurance. Like the mortgage insurance variable, the blank values can be filled. This will change the variable to indicate if the loan has no insurance, type 1, or type 2.
The remaining variable with a significant number of missing values is Co-Borrower Credit Score, with approximately half of its values missing. Unlike MI Percentage, the context does not allow us to substitute missing values with zeroes. The distribution of both borrower and co-borrower credit score as well as their relationship can be found below.
As the plot demonstrates, borrower and co-borrower credit scores are correlated. Because of this, the removal of co-borrower credit score would only result in a minimal loss of information (especially within the context of this example). Most of the variance captured by co-borrower credit score is also captured in borrower credit score. Turning the co-borrower credit score into a binary yes/no ‘has co-borrower’ variable would not be of much use in this scenario as it would not differ significantly from the Number of Borrowers variable. Alternate strategies such as averaging borrower/co-borrower credit score might work, but for this example we will simply drop the variable.
In summary, the dataset is now smaller—Co-Borrower Credit Score has been dropped. Additionally, missing values for MI Percentage and MI Type have been filled in. Now that the data have been cleaned up, the values and distributions of the remaining variables can be examined to determine what additional preprocessing steps are required before model building. Scatter matrices of pairs of variables and distribution plots of individual variables along the diagonal can be found below. The scatter plots are helpful for identifying multicollinearity between pairs of variables, and the distributions can show if a variable lacks enough variance that it won’t contribute to model performance.[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1089″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]The third row of scatterplots, above, reflects a lack of variability in the distribution of Original Loan Term. The variance of 3.01 (calculated separately) is very small, and as a result the variable can be removed—it will not contribute to any model as there is very little information to learn from. This process of inspecting scatterplots and distributions is repeated for the remaining pairs of variables. The Number of Units variable suffers from the same issue and can also be removed.
2 Heatmaps and Pairwise Grids
Matrices of scatterplots are useful for looking at the relationships between variables. Another useful plot is a heatmap and pairwise grid of correlation coefficients. In the plot below a very strong correlation between Original LTV and Original CLTV is identified.
This multicollinearity can be problematic for both the interpretation of the relationship between the variables and delinquency as well as the actual performance of some models. To combat this problem, we remove Original CLTV because Original LTV is a more accurate representation of the loan at origination. Loans in this population that were not refinanced kept their original LTV value as CLTV. If CLTV were included in the model it would introduce information not available at origination to the model. The problem of allowing unexpected additional information in a dataset introduces an issue known as leakage, which will bias the model.
Now that the numeric variables have been inspected, the remaining categorical variables must be analyzed to ensure that the classes are not significantly unbalanced. Count plots and simple descriptive statistics can be used to identify categorical variables are problematic. Two examples below show the count of loans by state and by seller.
Inspecting the remaining variables uncovers that Relocation Indicator (indicating a mortgage issued when an employer moves an employee) and Product Type (fixed vs. adjustable rate) must be removed as they are extremely unbalanced and do not contain any information that will help the models learn. We also removed first payment date and origination date, which were largely redundant. The final cleanup results in a dataset that contains the following columns:
LOAN_IDENTIFIER CHANNEL SELLER_NAME ORIGINAL_INTEREST_RATE ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB) ORIGINAL_LOAN-TO-VALUE_(LTV) NUMBER_OF_BORROWERS DEBT-TO-INCOME_RATIO_(DTI) BORROWER_CREDIT_SCORE FIRST-TIME_HOME_BUYER_INDICATOR LOAN_PURPOSE PROPERTY_TYPE OCCUPANCY_STATUS PROPERTY_STATE MORTGAGE_INSURANCE_PERCENTAGE MORTGAGE_INSURANCE_TYPE ZIP_(3-DIGIT)
The final two steps before model building are to standardize each of the numeric variables and turn each categorical variable into a series of dummy or indicator variables. Numeric variables are scaled with mean 0 and standard deviation 1 so that it is easier to compare variables that have a different scale (e.g. interest rate vs. LTV). Additionally, standardizing is also a requirement for many algorithms (e.g. principal component analysis).
Categorical variables are transformed by turning each n value of the variable into its own yes/no feature. For example, Property State originally has 50 possible values, so it will be turned into 50 variables (e.g. Alabama yes/no, Alaska yes/no). For categorical variables with many values this transformation will significantly increase the number of variables in the model.
After scaling and transforming the dataset, the final shape is 199,716 rows and 106 columns. The target variable—loan delinquency—has 186,094 ‘no’ values and 13,622 ‘yes’ values. The data are now ready to be used to build, evaluate, and tune machine learning models.
3 Model Selection
Because the target variable loan delinquency is binary (yes/no) the methods available will be classification machine learning models. There are many classification models, including but not limited to: neural networks, logistic regression, support vector machines, decision trees and nearest neighbors. It is always beneficial to seek out domain expertise when tackling a problem to learn best practices and reduce the number of model builds. For this example, two approaches will be tried—nearest neighbors and decision tree.
The first step is to split the dataset into two segments: training and testing. For this example, 40% of the data will be partitioned into the test set, and 60% will remain as the training set. The resulting segmentations are as follows:
1. 60% of the observations (as training set)- X_train
2. The associated target (loan delinquency) for each observation in X_train- y_train
3. 40% of the observations (as test set)- X_test
4. The targets associated with the test set- y_test
Data should be randomly shuffled before they are split, as datasets are often in some type of meaningful order. Once the data are segmented the model will first be exposed to the training data to begin learning.
4 K-Nearest Neighbors Classifier
Training a K-neighbors model requires the fitting of the model on X_train (variables) and y_train (target) training observations. Once the model is fit, a summary of the model hyperparameters is returned. Hyperparameters are model parameters not learned automatically but rather are selected by the model creator.
The K-neighbors algorithm searches for the K closest (i.e., most similar) training examples for each test observation using a metric that calculates the distance between observations in high-dimensional space. Once the nearest neighbors are identified, a predicted class label is generated as the class that is most prevalent in the neighbors. The biggest challenge with a K-neighbors classifier is choosing the number of K neighbors to use. Another significant consideration is the type of distance metric to use.
To see more clearly how this method works, the 6 nearest neighbors of two random observations from the training set were selected, one that is a non-default (0 label) observation and one that is not.
Random delinquent observation: 28919 Random non delinquent observation: 59504
The indices and minkowski distances to the 6 nearest neighbors of the two random observations are found below. Unsurprisingly, the first nearest neighbor is always itself and the first distance is 0.
Indices of closest neighbors of obs. 28919 [28919 112677 88645 103919 27218 15512] Distance of 5 closest neighbor for obs. 28919 [0 0.703 0.842 0.883 0.973 1.011] Indices of 5 closest neighbors for obs. 59504 [59504 87483 25903 22212 96220 118043] Distance of 5 closest neighbor for obs. 59504 [0 0.873 1.185 1.186 1.464 1.488]
Recall that in order to make a classification prediction, the kneighbors algorithm finds the K nearest neighbors of each observation. Each neighbor is given a ‘vote’ via their class label, and the majority vote wins. Below are the labels (or votes) of either 0 (non-delinquent) or 1 (delinquent) for the 6 nearest neighbors of the random observations. Based on the voting below, the delinquent observation would be classified correctly as 3 of the 5 nearest neighbors (excluding itself) are also delinquent. The non-delinquent observation would also be classified correctly, with 4 of 5 neighbors voting non-delinquent.
Delinquency label of nearest neighbors- non delinquent observation: [0 1 0 0 0 0] Delinquency label of nearest neighbors- delinquent observation: [1 0 1 1 0 1]
5 Tree-Based Classifier
Tree based classifiers learn by segmenting the variable space into a number of distinct regions or nodes. This is accomplished via a process called recursive binary splitting. During this process observations are continuously split into two groups by selecting the variable and cutoff value that results in the highest node purity where purity is defined as the measure of variance across the two classes. The two most popular purity metrics are the gini index and cross entropy. A low value for these metrics indicates that the resulting node is pure and contains predominantly observations from the same class. Just like the nearest neighbor classifier, the decision tree classifier makes classification decisions by ‘votes’ from observations within each final node (known as the leaf node).
To illustrate how this works, a decision tree was created with the number of splitting rules (max depth) limited to 5. An excerpt of this tree can be found below. All 120,000 training examples start together in the top box. From top to bottom, each box shows the variable and splitting rule applied to the observations, the value of the gini metric, the number of observations the rule was applied to, and the current segmentation of the target variable. The first box indicates that the 6th variable (represented by the 5th index ‘X’) Borrower Credit Score was used to split the training examples. Observations where the value of Borrower Credit Score was below or equal to -0.4413 follow the line to the box on the left. This box shows that 40,262 samples met the criteria. This box also holds the next splitting rule, also applied to the Borrower Credit Score variable. This process continues with X (Original LTV) and so on until the tree is finished growing to its depth of 5. The final segments at the bottom of the tree are the aforementioned leaf nodes which are used to make classification decisions. When making a prediction on new observations, the same splitting rules are applied and the observation receives the label of the most commonly occurring class in its leaf node.
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1086″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]A more advanced tree based classifier is the Random Forest Classifier. The Random Forest works by generating many individual trees, often hundreds or thousands. However, for each tree, number of variables considered at each split is limited to a random subset. This helps reduce model variance and de-correlate the trees (since each tree will have a different set of available splitting choices). In our example, we fit a random forest classifier on the training data. The resulting hyperparameters and model documentation indicate that by default the model generates 10 trees, considers a random subset of variables the size of the square root of all variables (approximately 10 in this case), has no depth limitation, and only requires each leaf node to have 1 observation.
Since the random forest contains many trees and does not have a depth limitation, it is incredibly difficult to visualize. In order to better understand the model, a plot showing which variables were selected and resulted in the largest drop in the purity metric (gini index) can be useful. Below are the top 10 most important variables in the model, ranked by the total (normalized) reduction to the gini index. Intuitively, this plot can be described as showing which variables can be used to best segment the observations into groups that are predominantly one class, either delinquent and non-delinquent.
6 Model Evaluation
Now that the models have been fitted, their performance must be evaluated. To do this, the fitted model will first be used to generate predictions on the test set (X_test). Next, the predicted class labels are compared to the actual observed class label (y_test). Three of the most popular classification metrics that can be used to compare the predicted and actual values are recall, precision, and the f1-score. These metrics are calculated for each class, delinquent and not-delinquent.
Recall is calculated for each class as the ratio of events that were correctly predicted. More precisely, it is defined as the number of true positive predictions divided by the number of true positive predictions plus false negative predictions. For example, if the data had 10 delinquent observations and 7 were correctly predicted, recall for delinquent observations would be 7/10 or 70%.
Precision is the number of true positives divided by the number of true positives plus false positives. Precision can be thought of as the ratio of events correctly predicted to the total number of events predicted. In the hypothetical example above, assume that the model made a total of 14 predictions for the label delinquent. If so, then the precision for delinquent predictions would be 7/14 or 50%.
The f1 score is calculated as the harmonic mean of recall and precision: (2(Precision*Recall/Precision+Recall)).
The classification reports for the K-neighbors and decision tree below show the precision, recall, and f1 scores for label 0 (non-delinquent) and 1 (delinquent).
There is no silver bullet for choosing a model—often it comes down to the goals of implementation. In this situation, the tradeoff between identifying more delinquent loans at the cost of misclassification can be analyzed with a specific tool called a roc curve. When the model predicts a class label, a probability threshold is used to make the decision. This threshold is set by default at 50% so that observations with more than a 50% chance of membership belong to one class and vice-versa.
The majority vote (of the neighbor observations or the leaf node observations) determines the predicted label. Roc curves allow us to see the impact of varying this voting threshold by plotting the true positive prediction rate against the false positive prediction rate for each threshold value between 0% and 100%.
The area under the ROC curve (AUC) quantifies the model’s ability to distinguish between delinquent and non-delinquent observations. A completely useless model will have an AUC of .5 as the probability for each event is equal. A perfect model will have an AUC of 1 as it is able to perfectly predict each class.
To better illustrate, the ROC curves plotting the true positive and false positive rate on the held-out test set as the threshold is changed are plotted below.
7 Model Tuning
Up to this point the models have been built and evaluated using a single train/test split of the data. In practice this is often insufficient because a single split does not always provide the most robust estimate of the error on the test set. Additionally, there are more steps required for model tuning. To solve both of these problems it is common to train multiple instances of a model using cross validation. In K-fold cross validation, the training data that was first created gets split into a third dataset called the validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated K times, each time holding out a different portion of the training set to validate against. Once the model has been tuned using the train/validation splits, it is tested against the held out test set just as before. As a general rule, once data have been used to make a decision about the model they should never be used for evaluation.
8 K-Nearest Neighbors Tuning
Below a grid search approach is used to tune the K-nearest neighbors model. The first step is to define all of the possible hyperparameters to try in the model. For the KNN model, the list nk = [10, 50, 100, 150, 200, 250] specifies the number of nearest neighbors to try in each model. The list is used by the function GridSearchCV to build a series of models, each using the different value of nk. By default, GridSearchCV uses 3-fold cross validation. This means that the model will evaluate 3 train/validate splits of the data for each value of nk. Also specified in GridSearchCV is the scoring parameter used to evaluate each model. In this instance it is set to the metric discussed earlier, the area under the roc curve. GridSearchCV will return the best performing model by default, which can then be used to generate predictions on the test set as before. Many more values of K could be specified to search through, and the default minkowski distance could be set to a series of metrics to try. However, this comes at a cost of computation time that increases significantly with each added hyperparameter.
In the plot below the mean training and validation scores of the 3 cross-validated splits is plotted for each value of K. The plot indicates that for the lower values of K the model was overfitting the training data and causing lower validation scores. As K increases, the training score lowers but the validation score increases because the model gets better at generalizing to unseen data.
9 Random Forest Tuning
There are many hyperparameters that can be adjusted to tune the random forest model. We use three in our example: n_estimators, max_features, and min_samples_leaf. N_estimators refers to the number of trees to be created. This value can be increased substantially, so the search space is set to list estimators. Random Forests are generally very robust to overfitting, and it is not uncommon to train a classifier with more than 1,000 trees. Second, the number of variables to be randomly considered at each split can be tuned via max_features. Having a smaller value for the number of random features is helpful for decorrelating the trees in the forest, which is especially useful when multicollinearity is present. We tried a number of different values for max_features, which can be found in the list features. Finally, the number of observations required in each leaf node is tuned via the min_samples_leaf parameter and list samples.
The resulting plot, below, shows a subset of the grid search results. Specifically, it shows the mean test score for each number of trees and leaf size when the number of random features considered at each split is limited to 5. The plot demonstrates that the best performance occurs with 500 trees and a requirement of at least 5 observations per leaf. To see the best performing model from the entire grid space the best estimator method can be used.
By default, parameters of the best estimator are assigned to the GridSearch object (cvknc and cvrfc). This object can now be used generate future predictions or predicted probabilities. In our example, the tuned models are used to generate predicted probabilities on the held out test set. The resulting
ROC curves show an improvement in the KNN model from an AUC of .62 to .75. Likewise, the tuned Random Forest AUC improves from .64 to .77.
Predicting loan delinquency using only origination data is not an easy task. Presumably, if significant signal existed in the data it would trigger a change in strategy by MBS investors and ultimately origination practices. Nevertheless, this exercise demonstrates the capability of a machine learning approach to deconstruct such an intricate problem and suggests the appropriateness of using machine learning model to tackle these and other risk management data challenges relating to mortgages and a potentially wide range of asset classes.
Big Data in Small Dimensions: Machine Learning Methods for Data Visualization
Analysts and data scientists are constantly seeking new ways to parse increasingly intricate datasets, many of which are deemed “high dimensional”, i.e., contain many (sometimes hundreds or more) individual variables. Machine learning has recently emerged as one such technique due to its exceptional ability to process massive quantities of data. A particularly useful machine learning method is t-distributed stochastic neighbor embedding (t-SNE), used to summarize very high-dimensional data using comparatively few variables. T-SNE visualizations allow analysts to identify hidden structures that may have otherwise been missed.
Traditional Data Visualization
The first step in tackling any analytical problem is to develop a solid understanding of the dataset in question. This process often begins with calculating descriptive statistics that summarize useful characteristics of each variable, such as the mean and variance. Also critical to this pursuit is the use of data visualizations that can illustrate the relationships between observations and variables and can identify issues that must be corrected. For example, the chart below shows a series of pairwise plots between a set of variables taken from a loan-level dataset. Along the diagonal axis the distribution of each individual variable is plotted.
The plot above is useful for identifying pairs of variables that are highly correlated as well as variables that lack variance, such as original loan term. When dealing with a larger number of variables, heatmaps like the one below can summarize the relationships between the data in a compact way that is also visually intuitive.
The statistics and visualizations described so far are helpful for summarizing and identifying issues, but they often fall short in telling the entire narrative of the data. One issue that remains is a lack of understanding of the underlying structure of the data. Gaining this understanding is often key to selecting the best approach for problem solving.
Enhanced Data Visualization with Machine Learning
Humans can visualize observations plotted with up to three variables (dimensions), but with the exponential rise in data collection it is now abnormal to only be dealing with a handful of variables. Thankfully, there are new machine learning methods that can help overcome our limited capacity and deliver new insights never seen before.
T-SNE is a type of non-linear dimensionality reduction algorithm. While this is a mouthful, the idea behind it is straightforward: t-SNE takes data that exists in very high dimensions and produces a plot in two or three dimensions that can be observed. The plot in low dimensions is created in such a way that observations close to each other in high dimensions remain close together in low dimensions. Additionally, t-SNE has proven to be good at preserving both the global and local structures present within the data1, which is of critical importance.
The full technical details of t-SNE are beyond the scope of this blog, but a simplified version of the steps for t-SNE are as follows:
- Compute the Euclidean distance between each pair of observations in high-dimensional space.
- Using a Gaussian distribution, convert the distance between each pair of observations into a probability that represents similarity between the points.
- Randomly place the observations into low-dimensional space (usually 2 or 3).
- Compute the distance and similarity (as in steps 1 and 2) for each pair of observations in the low-dimensional space. Crucially, in this step a Student t-distribution is used instead of a normal Gaussian.
- Using gradient based optimization, iteratively nudge the observations in the low-dimensional space in such a way that the probabilities between pairs of observations are as close as possible to the probabilities in high dimensions.
Two key consideration are the use of the Student t-distribution in step four as opposed to the Gaussian in step two, and the random initialization of the data points in low dimensional space. The t-distribution is critical to the success of the algorithm for multiple reasons, but perhaps most importantly in that it allows clusters that initially start far apart to re-converge2. Given the random initialization of the points in low dimensional space, it is common practice to run the algorithm multiple times with the same parameters to observe the best mapping and ensure that the gradient descent optimization does not get stuck in a local minima.
We applied t-SNE to a loan-level dataset comprised of approximately 40 variables. The loans are a random sample of originations from every quarter dating back to 1999. T-SNE was used to map the data into just three dimensions and the resulting plot was color-coded based on the year of origination.
In the interactive visualization below many clusters emerge. Rotating the figure reveals that some clusters are comprised predominantly of loans within similar origination years (groups of same-colored data points). Other clusters are less well-defined or contain a mix of origination years. Using this same method, we could choose to color loans with other information that we may wish to explore. For example, a mapping showing clusters related to delinquencies, foreclosure, or other credit loss events could prove tremendously insightful. For a given problem, using information from a plot such as this can enhance the understanding of the problem separability and enhance the analytical approach.
Crucial to the t-SNE mapping is a parameter set by the analyst called perplexity, which should be roughly equal to the number of expected nearby neighbors for each data point. Therefore, as the value of perplexity increases, the number of resulting clusters should generally decrease and vice versa. When implementing t-SNE, various perplexity parameters should be tried as the appropriate value is generally not known beforehand. The plot below was produced using the same dataset as before but with a larger value of perplexity. In this plot four distinct clusters emerge, and within each cluster loans of similar origination years group closely together.
Private-Label Securities – Technological Solutions to Information Asymmetry and Mistrust
At its heart, the failure of the private-label residential mortgage-backed securities (PLS) market to return to its pre-crisis volume is a failure of trust. Virtually every proposed remedy, in one way or another, seeks to create an environment in which deal participants can gain reasonable assurance that their counterparts are disclosing information that is both accurate and comprehensive. For better or worse, nine-figure transactions whose ultimate performance will be determined by the manner in which hundreds or thousands of anonymous people repay their mortgages cannot be negotiated on the basis of a handshake and reputation alone. The scale of these transactions makes manual verification both impractical and prohibitively expensive. Fortunately, the convergence of a stalled market with new technologies presents an ideal time for change and renewed hope to restore confidence in the system.
Trust in Agency-Backed Securities vs Private-Label Securities
Ginnie Mae guaranteed the world’s first mortgage-backed security nearly 50 years ago. The bankers who packaged, issued, and invested in this MBS could scarcely have imagined the technology that is available today. Trust, however, has never been an issue with Ginnie Mae securities, which are collateralized entirely by mortgages backed by the federal government—mortgages whose underwriting requirements are transparent, well understood, and consistently applied.
Further, the security itself is backed by the full faith and credit of the U.S. Government. This degree of “belt-and-suspenders” protection afforded to investors makes trust an afterthought and, as a result, Ginnie Mae securities are among the most liquid instruments in the world.
Contrast this with the private-label market. Private-label securities, by their nature, will always carry a higher degree of uncertainty than Ginnie Mae, Fannie Mae, and Freddie Mac (i.e., “Agency”) products, but uncertainty is not the problem. All lending and investment involves uncertainty. The problem is information asymmetry—where not all parties have equal access to the data necessary to assess risk. This asymmetry makes it challenging to price deals fairly and is a principal driver of illiquidity.
Using Technology to Garner Trust in the PLS Market
In many transactions, ten or more parties contribute in some manner to verifying and validating data, documents, or cash flow models. In order to overcome asymmetry and restore liquidity, the market will need to refine (and in some cases identify) technological solutions to, among other challenges, share loan-level data with investors, re-envision the due diligence process, and modernize document custody.
During SFIG’s Residential Mortgage Finance symposium last month, RiskSpan moderated a panel that featured significant discussion around loan-level disclosures. At issue was whether the data required by the SEC’s Regulation AB provided investors with all the information necessary to make an investment decision. Specifically debated was the mortgaged property’s zip code, which provides investors valuable information on historical valuation trends for properties in a given geographic area.
Privacy advocates question the wisdom of disclosing full, five-digit zip codes. Particularly in sparsely populated areas where zip codes contain a relatively small number of addresses, knowing the zip code along with the home’s sale price and date (which are both publicly available) can enable unscrupulous data analysts to “triangulate” in on an individual borrower’s identity and link the borrower to other, more sensitive personal information in the loan-level disclosure package.
The SEC’s “compromise” is to require disclosing only the first two digits of the zip code, which provide a sense of a property’s geography without the risk of violating privacy. Investors counter that two-digit zip codes do not provide nearly enough granularity to make an informed judgment about home-price stability (and with good reason—some entire states are covered entirely by a single two-digit zip code).
The competing demands of disclosure and privacy can be satisfied in large measure by technology. Rather than attempting to determine which individual data fields should be included in a loan-level disclosure (and then publishing it on the SEC’s EDGAR site for all the world to see) the market ought to be able to develop a technology where a secure, encrypted, password-protected copy of the loan documents (including the loan application, tax documents, pay-stubs, bank statements, and other relevant income, employment, and asset verifications) is made available on a need-to-know basis to qualified PLS investors who share in the responsibility for safeguarding the information.
Due Diligence Review
Technologically improving the transparency of the due diligence process to investors may also increase investor trust, particularly in the representation and warranty review process. Providing investors with a secure view of the loan-level documentation used to underwrite and close the underlying mortgage loan, as described above, may reduce the scope of due diligence review as it exists in today’s market. Technology companies, which today support initiatives such as Fannie Mae’s “Day 1 Certainty” program, promise to further disrupt the due diligence process in the future. Through automation, the due diligence process becomes less burdensome and fosters confidence in the underwriting process while also reducing costs and bringing representation and warranty relief.
Today’s insistence on 100% file reviews in many cases is perhaps the most obvious evidence of the lack of trust across transactions. Investors will likely always require some degree of assurance that they are getting what they pay for in terms of collateral. However, an automated verification process for income, assets, and employment will launch the industry forward with investor confidence. Should any reconciliation of individual loan file documentation with data files be necessary, results of these reconciliations could be automated and added to a secure blockchain accessible only via private permissions. Over time, investors will become more comfortable with the reliability of the electronic data files describing the mortgage loans submitted to them.
The same technology could be implemented to allow investors to view supporting documents when reps and warrants are triggered and a review of the underlying loan documents needs to be completed.
Smart document technologies also have the potential to improve the transparency of the document custody process. At some point the industry is going to have to move beyond today’s humidity-controlled file cabinets and vaults, where documents are obtained and viewed only on an exception basis or when loans are paid off. Adding loan documents that have been reviewed and accepted by the securitization’s document custodian to a secure, permissioned blockchain will allow investors in the securities to view and verify collateral documents whenever questions arise without going to the time and expense of retrieving paper from the custodian’s vault.
Securitization makes mortgages and other types of borrowing affordable for a greater population by leveraging the power of global capital markets. Few market participants view mortgage loan securitization dominated by government corporations and government-sponsored enterprises as a desirable permanent solution. Private markets, however, are going to continue to lag markets that benefit from implicit and explicit government guarantees until improved processes, supported by enhanced technologies, are successful in bridging gaps in trust and information asymmetry.
With trust restored, verified by technology, the PLS market will be better positioned to support housing financing needs not supported by the Agencies.
Tuning Machine Learning Models
Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”
Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate hyperparameters. This post focuses on three:
- Grid Search
- Random Search
- Bayesian Optimization
Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.
For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:
Then the pairing of the selected hyperparametric values, H, can take on any of the following:
Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.
Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).
To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).
Table 1 – Grid of Hyperparameter Values and Scores
Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.
In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value. In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.
There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.
The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.
- A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
- A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.
The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.
There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.
Probability of Improvement
One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:
In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.
Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:
Gaussian Process Upper Confidence Bound
Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:
There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.
Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.