Machine Learning Detects Model Validation Blind Spots
Machine learning represents the next frontier in model validation—particularly in the credit and prepayment modeling arena. Financial institutions employ numerous models to make predictions relating to MBS performance. Validating these models by assessing their predictions is of paramount importance, but even models that appear to perform well based upon summary statistics can have subsets of input (input subspaces) for which they tend to perform poorly. Isolating these “blind spots” can be challenging using conventional model validation techniques, but recently developed machine learning algorithms are making the job easier and the results more reliable.
High-Error Subspace Visualization
The modeling team at RiskSpan has developed a statistical algorithm which identifies high-error subspaces and flags model outputs corresponding to inputs originating from these subspaces, indicating to model users that the results might be unreliable. An extension to this problem that we also address is whether migration of data points to more error-prone subspaces of the input space over time can be indicative of macroeconomic regime shifts and signal a need to re-estimate the model. This will aid in the prevention of declining model efficacy over time.
Due to the high-dimensional nature of the input spaces of many financial models, traditional statistical methods of partitioning data may prove inadequate. Using machine learning techniques, we have developed a more robust method of high-error subspace identification. We develop the algorithm using loan performance model data, but the method is adaptable to generic models.
Data Selection and Preparation
The dataset we use for our analysis is a random sample of the publicly available Freddie Mac Loan-Level Dataset. The entire dataset covers the monthly loan performance for loans originated from 1999 to 2016 (25.4 million fixed-rate mortgages). From this set, one million loans were randomly sampled. Features of this dataset include loan-to-value ratio, borrower debt-to-income ratio, borrower credit score, interest rate, and loan status, among others. We aggregate the monthly status vectors for each loan into a single vector which contains a loan status time series over the life of the loan within the historical period. This aggregated status vector is mapped to a value of 1 if the time series indicates the loan was ever 90 days delinquent within the first three years after its origination, representing a default, and 0 otherwise. This procedure results in 914,802 total records.
Using the prepared loan dataset, we estimate a logistic regression loan performance model. The data is sampled and partitioned into training and test datasets for clustering analysis. The model estimation and training data is taken from loans originating in the period from 1999 to 2007, while loans originating in the period from 2008 to 2016 are used for testing. Once the data has been partitioned into training and test sets, a clustering algorithm is run on the training data.
Two-Dimensional Visualization of Select Clusters
The clustering is evaluated based upon its ability to stratify the loan data into clusters that meaningfully identify regions of the input for which the model performs poorly. This requires the average model performance error associated with certain clusters to be substantially higher than the mean. After the training data is assigned to clusters, cluster-level error is computed for each cluster using the logistic regression model. Clusters with high error are flagged based upon a scoring scheme. Each loan in the test set is assigned to a cluster based upon its proximity to the training cluster centers. Loans in the test set that are assigned to flagged clusters are flagged, indicating that the loan comes from a region for which loan performance model predictions exhibit lower accuracy.
Algorithm Performance Analysis
The clustering algorithm successfully flagged high-error regions of the input space, with flagged test clusters exhibiting accuracy more than one standard deviation below the mean. The high errors associated with clusters flagged during model training were persistent over time, with flagged clusters in the test set having a model accuracy of just 38.7%, compared to an accuracy of 92.1% for unflagged clusters. Failure to address observed high-error clusters in the training set and migration of data to high-error subspaces led to substantially diminished model accuracy, with overall model accuracy dropping from 93.9% in the earlier period to 84.1% in the later period.
Training/Test Cluster Error Comparison
Additionally, the nature of default misclassifications and variables with greatest impact on misclassification were also determined. Cluster FICO scores proved to be a strong indicator of cluster model prediction accuracy. While a relatively large proportion of loans in low-FICO clusters defaulted, the logistic regression model substantially overpredicted the number of defaults for these clusters, leading to a large number of Type I errors (inaccurate default predictions) for these clusters. Type II (inaccurate non-default predictions) errors constituted a smaller proportion of overall model error, and their impact was diminished even further when considering their magnitude relative to the number of true negative predictions (accurate non-default predictions), which are far fewer in number than true positive predictions (accurate default predictions).
FICO vs. Cluster Accuracy
Our application of the subspace error identification algorithm to a loan performance model illustrates the dangers of using high-level summary statistics as the sole determinant of model efficacy and failure to consistently monitor the statistical profile of model input data over time. Often, more advanced statistical analysis is required to comprehensively understand model performance. The algorithm identified sets of loans for which the model was systematically misclassifying default status. These large-scale errors come at a high cost to financial institutions employing such models.
As an extension to this research into high error subspace detection, RiskSpan is currently developing machine learning analytics tools that can detect the root cause of systematic model errors and suggest ways to enhance predictive model performance by alleviating these errors.