Linkedin    Twitter   Facebook

Get Started
Log In

Linkedin

Blog Archives

The Surging Reverse Mortgage Market

Momentum continues to build around reverse mortgages and related products. Persistent growth in both home prices and the senior population has stoked renewed interest and discussion about the most appropriate uses of accumulated home equity in financial planning strategies. A common and superficial way to think of reverse mortgages is as a “last-resort” means of covering expenses when more conventional planning tools prove insufficient. But experts increasingly are not thinking of reverse mortgages in this way. Last week, the American College of Financial Services and the Bipartisan Policy Center hosted the 2018 Housing Wealth in Retirement Symposium.  Speakers represented policy research think tanks, institutional asset managers, large banks, and AARP.  Notwithstanding the diversity of viewpoints, virtually every speaker reiterated a position that financial planners have posited for years: financial products that leverage home equity should, in many cases, be integrated into comprehensive retirement planning strategies, rather than being reserved as a product of last resort.

Senior Home Equity Continues Trending Upward

The National Reverse Mortgage Lenders Association (NRMLA) and RiskSpan have published the Reverse Mortgage Market Index (RMMI) since the beginning of 2000. The RMMI provides a trending measure of home equity of U.S. homeowners age 62 and older. The RMMI defines senior home equity as the difference between the aggregate value of homes owned and occupied by seniors and the aggregate mortgage balance secured by those homes. This measure enables the RMMI to help gauge the potential market size of those who may be qualified for a reverse mortgage product. The chart below illustrates the steady increase in this index since the end of the 2008 recession. It reached its latest all-time high in the most recent quarter (Q4 2017). Increasing house prices drive this trend, mitigated to some extent by a corresponding modest increase in mortgage debt held by seniors. The most recent RMMI report is published on NRMLA’s website. As summarized below by the Urban Institute, home equity can be extracted through many mechanisms, primarily Federal Housing Administration (FHA)–insured Home Equity Conversion Mortgages (HECMs), closed-end home equity loans, home equity lines of credit (HELOCs), and cash-out refinancing.

Share of Homeowners Who Extracted Home Equity by Strategy

The Urban Institute research goes on to point out that although few seniors have extracted home equity to date, the market is potentially very large (as reflected by the RMMI index) and more extraction is likely in the years ahead as the senior population both grows and ages. The data in the following chart confirm what one might reasonably expect—that younger seniors are more likely to have existing mortgages than older seniors.

 

Reverse Mortgage as Retirement Planning Tool

Looking at senior home equity in the context of overall net worth lends support to financial planners’ view of products like reverse mortgages as more than something on which to fall back as a last resort. The first three rows of data in the table below contains the median net worth by age cohort in 2013 and 2016, respectively, from Federal Reserve Board’s Survey of Consumer Finances. The bottom row, highlighted in yellow, is the estimated average senior home equity (total senior home equity as computed by the RMMI divided by senior population) for the same years. We acknowledge the imprecision inherent in this comparison due to the statistical method used (median vs. average) and certain data limitations on RMMI (addressed below). Additionally, the net worth figures may include non-homeowners. Nonetheless, home equity is an unignorably important component of senior net worth.

Following the release of the Federal Reserve’s 2016 Survey of Consumer Finances https://www.federalreserve.gov/econres/scfindex.htm, the Urban Institute published a summary research paper “What the 2016 Survey of Consumer Finances Tells Us about Senior Homeowners” https://www.urban.org/sites/default/files/publication/94526/what-the-2016-survey-of-consumer-finances-tells-us-about-senior-homeowners.pdf in November 2017.  The paper notes that “Worries about retirement security are rooted in several factors, such as Social Security changes that shrink the share of preretirement earnings replaced by the program (Munnell and Sundén 2005), rising medical and long-term care costs (Johnson and Mommaerts 2009, 2010), student loan burdens, and the shift from employer-sponsored defined-benefit pension plans that guarantee lifetime income to 401(k)-type defined-contribution plans whose account balances depend on employee contributions and uncertain investment returns (Munnell 2014; Munnell and Sundén 2005). In addition, increased life expectancies require retirement savings to last longer.”

The financial position of seniors is evolving.  Forty-one percent of homeowners age 65 and older now have a mortgage on their primary residence, compared with just 21 percent in 1989, and the median outstanding debt has risen from $16,793 to $72,000, according to the Urban Institute. As more households enter retirement with more debt, a growing number will likely tap into their home as a source of income. Hurdles and challenges remain, however, and education will play an important role in fostering responsible use of reverse mortgage products.

Note on the Limitations of RMMI

To calculate the RMMI, an econometric tool is developed to estimate senior housing value, senior mortgage level, and senior equity using data gathered from various public resources such as American Community Survey (ACS), Federal Reserve Flow of Funds (Z.1), and FHFA housing price indexes (HPI). The RMMI is simply the senior equity level at time of measure relative to that of the base quarter in 2000.[1]  The main limitation of RMMI is non-consecutive data, such as census population. We use a smoothing approach to estimate data in between the observable periods and continue to look for ways to improve our methodology and find more robust data to improve the precision of the results. Until then, the RMMI and its relative metrics (values, mortgages, home equities) are best analyzed at a trending macro level, rather than at more granular levels, such as MSA.


[1] There was a change in RMMI methodology in Q3 2015 mainly to calibrate senior homeowner population and senior housing values observed in 2013 American Community Survey (ACS).


Machine Learning Detects Model Validation Blind Spots

Machine learning represents the next frontier in model validation—particularly in the credit and prepayment modeling arena. Financial institutions employ numerous models to make predictions relating to MBS performance. Validating these models by assessing their predictions is of paramount importance, but even models that appear to perform well based upon summary statistics can have subsets of input (input subspaces) for which they tend to perform poorly. Isolating these “blind spots” can be challenging using conventional model validation techniques, but recently developed machine learning algorithms are making the job easier and the results more reliable. 

High-Error Subspace Visualization

RiskSpan’s modeling team has developed a statistical algorithm which identifies high-error subspaces and flags model outputs corresponding to inputs originating from these subspaces, indicating to model users that the results might be unreliable. An extension to this problem that we also address is whether migration of data points to more error-prone subspaces of the input space over time can be indicative of macroeconomic regime shifts and signal a need to re-estimate the model. This will aid in the prevention of declining model efficacy over time.

Due to the high-dimensional nature of the input spaces of many financial models, traditional statistical methods of partitioning data may prove inadequate. Using machine learning techniques, we have developed a more robust method of high-error subspace identification. We develop the algorithm using loan performance model data, but the method is adaptable to generic models.

Data Selection and Preparation

The dataset we use for our analysis is a random sample of the publicly available Freddie Mac Loan-Level Dataset. The entire dataset covers the monthly loan performance for loans originated from 1999 to 2016 (25.4 million fixed-rate mortgages). From this set, one million loans were randomly sampled. Features of this dataset include loan-to-value ratio, borrower debt-to-income ratio, borrower credit score, interest rate, and loan status, among others. We aggregate the monthly status vectors for each loan into a single vector which contains a loan status time series over the life of the loan within the historical period. This aggregated status vector is mapped to a value of 1 if the time series indicates the loan was ever 90 days delinquent within the first three years after its origination, representing a default, and 0 otherwise. This procedure results in 914,802 total records.

Algorithm Framework

Using the prepared loan dataset, we estimate a logistic regression loan performance model. The data is sampled and partitioned into training and test datasets for clustering analysis. The model estimation and training data is taken from loans originating in the period from 1999 to 2007, while loans originating in the period from 2008 to 2016 are used for testing. Once the data has been partitioned into training and test sets, a clustering algorithm is run on the training data.

Two-Dimensional Visualization of Select Clusters

The clustering is evaluated based upon its ability to stratify the loan data into clusters that meaningfully identify regions of the input for which the model performs poorly. This requires the average model performance error associated with certain clusters to be substantially higher than the mean. After the training data is assigned to clusters, cluster-level error is computed for each cluster using the logistic regression model. Clusters with high error are flagged based upon a scoring scheme. Each loan in the test set is assigned to a cluster based upon its proximity to the training cluster centers. Loans in the test set that are assigned to flagged clusters are flagged, indicating that the loan comes from a region for which loan performance model predictions exhibit lower accuracy.

Algorithm Performance Analysis

The clustering algorithm successfully flagged high-error regions of the input space, with flagged test clusters exhibiting accuracy more than one standard deviation below the mean. The high errors associated with clusters flagged during model training were persistent over time, with flagged clusters in the test set having a model accuracy of just 38.7%, compared to an accuracy of 92.1% for unflagged clusters. Failure to address observed high-error clusters in the training set and migration of data to high-error subspaces led to substantially diminished model accuracy, with overall model accuracy dropping from 93.9% in the earlier period to 84.1% in the later period.

Training/Test Cluster Error Comparison

Additionally, the nature of default misclassifications and variables with greatest impact on misclassification were also determined. Cluster FICO scores proved to be a strong indicator of cluster model prediction accuracy. While a relatively large proportion of loans in low-FICO clusters defaulted, the logistic regression model substantially overpredicted the number of defaults for these clusters, leading to a large number of Type I errors (inaccurate default predictions) for these clusters. Type II (inaccurate non-default predictions) errors constituted a smaller proportion of overall model error, and their impact was diminished even further when considering their magnitude relative to the number of true negative predictions (accurate non-default predictions), which are far fewer in number than true positive predictions (accurate default predictions).

FICO vs. Cluster Accuracy

Conclusion

Our application of the subspace error identification algorithm to a loan performance model illustrates the dangers of using high-level summary statistics as the sole determinant of model efficacy and failure to consistently monitor the statistical profile of model input data over time. Often, more advanced statistical analysis is required to comprehensively understand model performance. The algorithm identified sets of loans for which the model was systematically misclassifying default status. These large-scale errors come at a high cost to financial institutions employing such models.

As an extension to this research into high error subspace detection, RiskSpan is currently developing machine learning analytics tools that can detect the root cause of systematic model errors and suggest ways to enhance predictive model performance by alleviating these errors.


Hands-On Machine Learning–Predicting Loan Delinquency

The ability of machine learning models to predict loan performance makes them particularly interesting to lenders and fixed-income investors. This expanded post provides an example of applying the machine learning process to a loan-level dataset in order to predict delinquency. The process includes variable selection, model selection, model evaluation, and model tuning.

The data used in this example are from the first quarter of 2005 and come from the publicly available Fannie Mae performance dataset. The data are segmented into two different sets: acquisition and performance. The acquisition dataset contains 217,000 loans (rows) and 25 variables (columns) collected at origination (Q1 2005). The performance dataset contains the same set of 217,000 loans coupled with 31 variables that are updated each month over the life of the loan. Because there are multiple records for each loan, the performance dataset contains approximately 16 million rows.

For this exercise, the problem is to build a model capable of predicting which loans will become severely delinquent, defined as falling behind six or more months on payments. This delinquency variable was calculated from the performance dataset for all loans and merged with the acquisition data based on the loan’s unique identifier. This brings the total number of variables to 26. Plenty of other hypotheses can be tested, but this analysis focuses on just this one.

1          Variable Selection

An overview of the dataset can be found below, showing the name of each variable as well as the number of observations available

                                            Count
LOAN_IDENTIFIER                             217088
CHANNEL                                     217088
SELLER_NAME                                 217088
ORIGINAL_INTEREST_RATE                      217088
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB)     217088
ORIGINAL_LOAN_TERM                          217088
ORIGINATION_DATE                            217088
FIRST_PAYMENT_DATE                          217088
ORIGINAL_LOAN-TO-VALUE_(LTV)                217088
ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV)      217074
NUMBER_OF_BORROWERS                         217082
DEBT-TO-INCOME_RATIO_(DTI)                  201580
BORROWER_CREDIT_SCORE                       215114
FIRST-TIME_HOME_BUYER_INDICATOR             217088
LOAN_PURPOSE                                217088
PROPERTY_TYPE                               217088
NUMBER_OF_UNITS                             217088
OCCUPANCY_STATUS                            217088
PROPERTY_STATE                              217088
ZIP_(3-DIGIT)                               217088
MORTGAGE_INSURANCE_PERCENTAGE                34432
PRODUCT_TYPE                                217088
CO-BORROWER_CREDIT_SCORE                    100734
MORTGAGE_INSURANCE_TYPE                      34432
RELOCATION_MORTGAGE_INDICATOR               217088

Most of the variables in the dataset are fully populated, with the exception of DTI, MI Percentage, MI Type, and Co-Borrower Credit Score. Many options exist for dealing with missing variables, including dropping the rows that are missing, eliminating the variable, substituting with a value such as 0 or the mean, or using a model to fill the most likely value.

The following chart plots the frequency of the 34,000 MI Percentage values.

The distribution suggests a decent amount of variability. Most loans that have mortgage insurance are covered at 25%, but there are sizeable populations both above and below. Mortgage insurance is not required for the majority of borrowers, so it makes sense that this value would be missing for most loans.  In this context, it makes the most sense to substitute the missing values with 0, since 0% mortgage insurance is an accurate representation of the state of the loan. An alternative that could be considered is to turn the variable into a binary yes/no variable indicating if the loan has mortgage insurance, though this would result in a loss of information.

The next variable with a large number of missing values is Mortgage Insurance Type. Querying the dataset reveals that that of the 34,400 loans that have mortgage insurance, 33,000 have type 1 borrower paid insurance and the remaining 1,400 have type 2 lender paid insurance. Like the mortgage insurance variable, the blank values can be filled. This will change the variable to indicate if the loan has no insurance, type 1, or type 2.

The remaining variable with a significant number of missing values is Co-Borrower Credit Score, with approximately half of its values missing. Unlike MI Percentage, the context does not allow us to substitute missing values with zeroes. The distribution of both borrower and co-borrower credit score as well as their relationship can be found below.

As the plot demonstrates, borrower and co-borrower credit scores are correlated. Because of this, the removal of co-borrower credit score would only result in a minimal loss of information (especially within the context of this example). Most of the variance captured by co-borrower credit score is also captured in borrower credit score. Turning the co-borrower credit score into a binary yes/no ‘has co-borrower’ variable would not be of much use in this scenario as it would not differ significantly from the Number of Borrowers variable. Alternate strategies such as averaging borrower/co-borrower credit score might work, but for this example we will simply drop the variable.

In summary, the dataset is now smaller—Co-Borrower Credit Score has been dropped. Additionally, missing values for MI Percentage and MI Type have been filled in. Now that the data have been cleaned up, the values and distributions of the remaining variables can be examined to determine what additional preprocessing steps are required before model building. Scatter matrices of pairs of variables and distribution plots of individual variables along the diagonal can be found below. The scatter plots are helpful for identifying multicollinearity between pairs of variables, and the distributions can show if a variable lacks enough variance that it won’t contribute to model performance.[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1089″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]The third row of scatterplots, above, reflects a lack of variability in the distribution of Original Loan Term. The variance of 3.01 (calculated separately) is very small, and as a result the variable can be removed—it will not contribute to any model as there is very little information to learn from. This process of inspecting scatterplots and distributions is repeated for the remaining pairs of variables. The Number of Units variable suffers from the same issue and can also be removed.

2          Heatmaps and Pairwise Grids

Matrices of scatterplots are useful for looking at the relationships between variables. Another useful plot is a heatmap and pairwise grid of correlation coefficients. In the plot below a very strong correlation between Original LTV and Original CLTV is identified.

This multicollinearity can be problematic for both the interpretation of the relationship between the variables and delinquency as well as the actual performance of some models.  To combat this problem, we remove Original CLTV because Original LTV is a more accurate representation of the loan at origination. Loans in this population that were not refinanced kept their original LTV value as CLTV. If CLTV were included in the model it would introduce information not available at origination to the model. The problem of allowing unexpected additional information in a dataset introduces an issue known as leakage, which will bias the model.

Now that the numeric variables have been inspected, the remaining categorical variables must be analyzed to ensure that the classes are not significantly unbalanced. Count plots and simple descriptive statistics can be used to identify categorical variables are problematic. Two examples below show the count of loans by state and by seller.

Inspecting the remaining variables uncovers that Relocation Indicator (indicating a mortgage issued when an employer moves an employee) and Product Type (fixed vs. adjustable rate) must be removed as they are extremely unbalanced and do not contain any information that will help the models learn. We also removed first payment date and origination date, which were largely redundant. The final cleanup results in a dataset that contains the following columns:

LOAN_IDENTIFIER 
CHANNEL 
SELLER_NAME
ORIGINAL_INTEREST_RATE
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB) 
ORIGINAL_LOAN-TO-VALUE_(LTV) 
NUMBER_OF_BORROWERS
DEBT-TO-INCOME_RATIO_(DTI) 
BORROWER_CREDIT_SCORE
FIRST-TIME_HOME_BUYER_INDICATOR 
LOAN_PURPOSE
PROPERTY_TYPE 
OCCUPANCY_STATUS 
PROPERTY_STATE
MORTGAGE_INSURANCE_PERCENTAGE 
MORTGAGE_INSURANCE_TYPE 
ZIP_(3-DIGIT)

The final two steps before model building are to standardize each of the numeric variables and turn each categorical variable into a series of dummy or indicator variables. Numeric variables are scaled with mean 0 and standard deviation 1 so that it is easier to compare variables that have a different scale (e.g. interest rate vs. LTV). Additionally, standardizing is also a requirement for many algorithms (e.g. principal component analysis).

Categorical variables are transformed by turning each value of the variable into its own yes/no feature. For example, Property State originally has 50 possible values, so it will be turned into 50 variables (e.g. Alabama yes/no, Alaska yes/no).  For categorical variables with many values this transformation will significantly increase the number of variables in the model.

After scaling and transforming the dataset, the final shape is 199,716 rows and 106 columns. The target variable—loan delinquency—has 186,094 ‘no’ values and 13,622 ‘yes’ values. The data are now ready to be used to build, evaluate, and tune machine learning models.

3          Model Selection

Because the target variable loan delinquency is binary (yes/no) the methods available will be classification machine learning models. There are many classification models, including but not limited to: neural networks, logistic regression, support vector machines, decision trees and nearest neighbors. It is always beneficial to seek out domain expertise when tackling a problem to learn best practices and reduce the number of model builds. For this example, two approaches will be tried—nearest neighbors and decision tree.

The first step is to split the dataset into two segments: training and testing. For this example, 40% of the data will be partitioned into the test set, and 60% will remain as the training set. The resulting segmentations are as follows:

1.       60% of the observations (as training set)- X_train

2.       The associated target (loan delinquency) for each observation in X_train- y_train

3.       40% of the observations (as test set)- X_test

4.        The targets associated with the test set- y_test

Data should be randomly shuffled before they are split, as datasets are often in some type of meaningful order. Once the data are segmented the model will first be exposed to the training data to begin learning.

4          K-Nearest Neighbors Classifier

Training a K-neighbors model requires the fitting of the model on X_train (variables) and y_train (target) training observations. Once the model is fit, a summary of the model hyperparameters is returned. Hyperparameters are model parameters not learned automatically but rather are selected by the model creator.

 

The K-neighbors algorithm searches for the closest (i.e., most similar) training examples for each test observation using a metric that calculates the distance between observations in high-dimensional space.  Once the nearest neighbors are identified, a predicted class label is generated as the class that is most prevalent in the neighbors. The biggest challenge with a K-neighbors classifier is choosing the number of neighbors to use. Another significant consideration is the type of distance metric to use.

To see more clearly how this method works, the 6 nearest neighbors of two random observations from the training set were selected, one that is a non-default (0 label) observation and one that is not.

Random delinquent observation: 28919 
Random non delinquent observation: 59504

The indices and minkowski distances to the 6 nearest neighbors of the two random observations are found below. Unsurprisingly, the first nearest neighbor is always itself and the first distance is 0.

Indices of closest neighbors of obs. 28919 [28919 112677 88645 103919 27218 15512]
Distance of 5 closest neighbor for obs. 28919 [0 0.703 0.842 0.883 0.973 1.011]

Indices of 5 closest neighbors for obs. 59504 [59504 87483 25903 22212 96220 118043]
Distance of 5 closest neighbor for obs. 59504 [0 0.873 1.185 1.186 1.464 1.488]

Recall that in order to make a classification prediction, the kneighbors algorithm finds the nearest neighbors of each observation. Each neighbor is given a ‘vote’ via their class label, and the majority vote wins. Below are the labels (or votes) of either 0 (non-delinquent) or 1 (delinquent) for the 6 nearest neighbors of the random observations. Based on the voting below, the delinquent observation would be classified correctly as 3 of the 5 nearest neighbors (excluding itself) are also delinquent. The non-delinquent observation would also be classified correctly, with 4 of 5 neighbors voting non-delinquent.

Delinquency label of nearest neighbors- non delinquent observation: [0 1 0 0 0 0]
Delinquency label of nearest neighbors- delinquent observation: [1 0 1 1 0 1]

 

5          Tree-Based Classifier

Tree based classifiers learn by segmenting the variable space into a number of distinct regions or nodes. This is accomplished via a process called recursive binary splitting. During this process observations are continuously split into two groups by selecting the variable and cutoff value that results in the highest node purity where purity is defined as the measure of variance across the two classes. The two most popular purity metrics are the gini index and cross entropy. A low value for these metrics indicates that the resulting node is pure and contains predominantly observations from the same class. Just like the nearest neighbor classifier, the decision tree classifier makes classification decisions by ‘votes’ from observations within each final node (known as the leaf node).

To illustrate how this works, a decision tree was created with the number of splitting rules (max depth) limited to 5. An excerpt of this tree can be found below. All 120,000 training examples start together in the top box. From top to bottom, each box shows the variable and splitting rule applied to the observations, the value of the gini metric, the number of observations the rule was applied to, and the current segmentation of the target variable. The first box indicates that the 6th variable (represented by the 5th index ‘X[5]’) Borrower Credit Score was  used to  split  the  training  examples.  Observations where the value of Borrower Credit Score was below or equal to -0.4413 follow the line to the box on the left. This box shows that 40,262 samples met the criteria. This box also holds the next splitting rule, also applied to the Borrower Credit Score variable. This process continues with X[2] (Original LTV) and so on until the tree is finished growing to its depth of 5. The final segments at the bottom of the tree are the aforementioned leaf nodes which are used to make classification decisions.  When making a prediction on new observations, the same splitting rules are applied and the observation receives the label of the most commonly occurring class in its leaf node.

[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1086″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]A more advanced tree based classifier is the Random Forest Classifier. The Random Forest works by generating many individual trees, often hundreds or thousands. However, for each tree, number of variables considered at each split is limited to a random subset. This helps reduce model variance and de-correlate the trees (since each tree will have a different set of available splitting choices). In our example, we fit a random forest classifier on the training data. The resulting hyperparameters and model documentation indicate that by default the model generates 10 trees, considers a random subset of variables the size of the square root of all variables (approximately 10 in this case), has no depth limitation, and only requires each leaf node to have 1 observation.

Since the random forest contains many trees and does not have a depth limitation, it is incredibly difficult to visualize. In order to better understand the model, a plot showing which variables were selected and resulted in the largest drop in the purity metric (gini index) can be useful. Below are the top 10 most important variables in the model, ranked by the total (normalized) reduction to the gini index.  Intuitively, this plot can be described as showing which variables can be used to best segment the observations into groups that are predominantly one class, either delinquent and non-delinquent.

 

6          Model Evaluation

Now that the models have been fitted, their performance must be evaluated. To do this, the fitted model will first be used to generate predictions on the test set (X_test). Next, the predicted class labels are compared to the actual observed class label (y_test). Three of the most popular classification metrics that can be used to compare the predicted and actual values are recall, precision, and the f1-score. These metrics are calculated for each class, delinquent and not-delinquent.

Recall is calculated for each class as the ratio of events that were correctly predicted. More precisely, it is defined as the number of true positive predictions divided by the number of true positive predictions plus false negative predictions. For example, if the data had 10 delinquent observations and 7 were correctly predicted, recall for delinquent observations would be 7/10 or 70%.

Precision is the number of true positives divided by the number of true positives plus false positives. Precision can be thought of as the ratio of events correctly predicted to the total number of events predicted. In the hypothetical example above, assume that the model made a total of 14 predictions for the label delinquent. If so, then the precision for delinquent predictions would be 7/14 or 50%.

The f1 score is calculated as the harmonic mean of recall and precision: (2(Precision*Recall/Precision+Recall)).

The classification reports for the K-neighbors and decision tree below show the precision, recall, and f1 scores for label 0 (non-delinquent) and 1 (delinquent).

 

There is no silver bullet for choosing a model—often it comes down to the goals of implementation. In this situation, the tradeoff between identifying more delinquent loans at the cost of misclassification can be analyzed with a specific tool called a roc curve.  When the model predicts a class label, a probability threshold is used to make the decision. This threshold is set by default at 50% so that observations with more than a 50% chance of membership belong to one class and vice-versa.

The majority vote (of the neighbor observations or the leaf node observations) determines the predicted label. Roc curves allow us to see the impact of varying this voting threshold by plotting the true positive prediction rate against the false positive prediction rate for each threshold value between 0% and 100%.

The area under the ROC curve (AUC) quantifies the model’s ability to distinguish between delinquent and non-delinquent observations.  A completely useless model will have an AUC of .5 as the probability for each event is equal. A perfect model will have an AUC of 1 as it is able to perfectly predict each class.

To better illustrate, the ROC curves plotting the true positive and false positive rate on the held-out test set as the threshold is changed are plotted below.

7          Model Tuning

Up to this point the models have been built and evaluated using a single train/test split of the data. In practice this is often insufficient because a single split does not always provide the most robust estimate of the error on the test set. Additionally, there are more steps required for model tuning. To solve both of these problems it is common to train multiple instances of a model using cross validation. In K-fold cross validation, the training data that was first created gets split into a third dataset called the validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated times, each time holding out a different portion of the training set to validate against. Once the model has been tuned using the train/validation splits, it is tested against the held out test set just as before. As a general rule, once data have been used to make a decision about the model they should never be used for evaluation.

8          K-Nearest Neighbors Tuning

Below a grid search approach is used to tune the K-nearest neighbors model. The first step is to define all of the possible hyperparameters to try in the model. For the KNN model, the list nk = [10, 50, 100, 150, 200, 250] specifies the number of nearest neighbors to try in each model. The list is used by the function GridSearchCV to build a series of models, each using the different value of nk. By default, GridSearchCV uses 3-fold cross validation. This means that the model will evaluate 3 train/validate splits of the data for each value of nk. Also specified in GridSearchCV is the scoring parameter used to evaluate each model. In this instance it is set to the metric discussed earlier, the area under the roc curve. GridSearchCV will return the best performing model by default, which can then be used to generate predictions on the test set as before. Many more values of could be specified to search through, and the default minkowski distance could be set to a series of metrics to try. However, this comes at a cost of computation time that increases significantly with each added hyperparameter.

 

In the plot below the mean training and validation scores of the 3 cross-validated splits is plotted for each value of K. The plot indicates that for the lower values of the model was overfitting the training data and causing lower validation scores. As increases, the training score lowers but the validation score increases because the model gets better at generalizing to unseen data.

9               Random Forest Tuning

There are many hyperparameters that can be adjusted to tune the random forest model. We use three in our example: n_estimatorsmax_features, and min_samples_leafN_estimators refers to the number of trees to be created. This value can be increased substantially, so the search space is set to list estimators. Random Forests are generally very robust to overfitting, and it is not uncommon to train a classifier with more than 1,000 trees. Second, the number of variables to be randomly considered at each split can be tuned via max_features. Having a smaller value for the number of random features is helpful for decorrelating the trees in the forest, which is especially useful when multicollinearity is present. We tried a number of different values for max_features, which can be found in the list features. Finally, the number of observations required in each leaf node is tuned via the min_samples_leaf parameter and list samples.

 

The resulting plot, below, shows a subset of the grid search results. Specifically, it shows the mean test score for each number of trees and leaf size when the number of random features considered at each split is limited to 5. The plot demonstrates that the best performance occurs with 500 trees and a requirement of at least 5 observations per leaf. To see the best performing model from the entire grid space the best estimator method can be used.

By default, parameters of the best estimator are assigned to the GridSearch object (cvknc and cvrfc). This object can now be used generate future predictions or predicted probabilities. In our example, the tuned models are used to generate predicted probabilities on the held out test set. The resulting

ROC curves show an improvement in the KNN model from an AUC of .62 to .75. Likewise, the tuned Random Forest AUC improves from .64 to .77.

Predicting loan delinquency using only origination data is not an easy task. Presumably, if significant signal existed in the data it would trigger a change in strategy by MBS investors and ultimately origination practices. Nevertheless, this exercise demonstrates the capability of a machine learning approach to deconstruct such an intricate problem and suggests the appropriateness of using machine learning model to tackle these and other risk management data challenges relating to mortgages and a potentially wide range of asset classes.

Talk Scope


Big Data in Small Dimensions: Machine Learning Methods for Data Visualization

Analysts and data scientists are constantly seeking new ways to parse increasingly intricate datasets, many of which are deemed “high dimensional”, i.e., contain many (sometimes hundreds or more) individual variables. Machine learning has recently emerged as one such technique due to its exceptional ability to process massive quantities of data. A particularly useful machine learning method is t-distributed stochastic neighbor embedding (t-SNE), used to summarize very high-dimensional data using comparatively few variables. T-SNE visualizations allow analysts to identify hidden structures that may have otherwise been missed.

Traditional Data Visualization

The first step in tackling any analytical problem is to develop a solid understanding of the dataset in question. This process often begins with calculating descriptive statistics that summarize useful characteristics of each variable, such as the mean and variance. Also critical to this pursuit is the use of data visualizations that can illustrate the relationships between observations and variables and can identify issues that must be corrected. For example, the chart below shows a series of pairwise plots between a set of variables taken from a loan-level dataset. Along the diagonal axis the distribution of each individual variable is plotted.

The plot above is useful for identifying pairs of variables that are highly correlated as well as variables that lack variance, such as original loan term. When dealing with a larger number of variables, heatmaps like the one below can summarize the relationships between the data in a compact way that is also visually intuitive.

The statistics and visualizations described so far are helpful for summarizing and identifying issues, but they often fall short in telling the entire narrative of the data. One issue that remains is a lack of understanding of the underlying structure of the data. Gaining this understanding is often key to selecting the best approach for problem solving.

Enhanced Data Visualization with Machine Learning

Humans can visualize observations plotted with up to three variables (dimensions), but with the exponential rise in data collection it is now abnormal to only be dealing with a handful of variables. Thankfully, there are new machine learning methods that can help overcome our limited capacity and deliver new insights never seen before.

T-SNE is a type of non-linear dimensionality reduction algorithm. While this is a mouthful, the idea behind it is straightforward: t-SNE takes data that exists in very high dimensions and produces a plot in two or three dimensions that can be observed. The plot in low dimensions is created in such a way that observations close to each other in high dimensions remain close together in low dimensions. Additionally, t-SNE has proven to be good at preserving both the global and local structures present within the data1, which is of critical importance.

The full technical details of t-SNE are beyond the scope of this blog, but a simplified version of the steps for t-SNE are as follows:

  1. Compute the Euclidean distance between each pair of observations in high-dimensional space.
  2. Using a Gaussian distribution, convert the distance between each pair of observations into a probability that represents similarity between the points.
  3. Randomly place the observations into low-dimensional space (usually 2 or 3).
  4. Compute the distance and similarity (as in steps 1 and 2) for each pair of observations in the low-dimensional space. Crucially, in this step a Student t-distribution is used instead of a normal Gaussian.
  5. Using gradient based optimization, iteratively nudge the observations in the low-dimensional space in such a way that the probabilities between pairs of observations are as close as possible to the probabilities in high dimensions.

Two key consideration are the use of the Student t-distribution in step four as opposed to the Gaussian in step two, and the random initialization of the data points in low dimensional space. The t-distribution is critical to the success of the algorithm for multiple reasons, but perhaps most importantly in that it allows clusters that initially start far apart to re-converge2. Given the random initialization of the points in low dimensional space, it is common practice to run the algorithm multiple times with the same parameters to observe the best mapping and ensure that the gradient descent optimization does not get stuck in a local minima.

We applied t-SNE to a loan-level dataset comprised of approximately 40 variables. The loans are a random sample of originations from every quarter dating back to 1999. T-SNE was used to map the data into just three dimensions and the resulting plot was color-coded based on the year of origination.

In the interactive visualization below many clusters emerge. Rotating the figure reveals that some clusters are comprised predominantly of loans within similar origination years (groups of same-colored data points). Other clusters are less well-defined or contain a mix of origination years. Using this same method, we could choose to color loans with other information that we may wish to explore. For example, a mapping showing clusters related to delinquencies, foreclosure, or other credit loss events could prove tremendously insightful. For a given problem, using information from a plot such as this can enhance the understanding of the problem separability and enhance the analytical approach.

Crucial to the t-SNE mapping is a parameter set by the analyst called perplexity, which should be roughly equal to the number of expected nearby neighbors for each data point. Therefore, as the value of perplexity increases, the number of resulting clusters should generally decrease and vice versa. When implementing t-SNE, various perplexity parameters should be tried as the appropriate value is generally not known beforehand. The plot below was produced using the same dataset as before but with a larger value of perplexity. In this plot four distinct clusters emerge, and within each cluster loans of similar origination years group closely together.


How Buyouts Drive Ginnie Mae Prepayment Speeds

Because Ginnie Mae mortgage-backed securities are backed by the full faith and credit of the U.S. government, investors are not subject to credit losses. However, the potential for non-performing loan buyouts creates an additional layer of prepayment risk. As with any prepayment, investors receive the unpaid principal balance of the loan that goes through buyout. However, for all 30-year pass-throughs with 3% and higher coupons trading above par, any prepayment (due to a buyout or otherwise) represents a loss to the investor.

So how much of a concern are buyouts for investors?

Prepayments

Prepayments for Ginnie Mae MBS are comprised of a voluntary component (the conditional repayment rate, CRR) along with an involuntary portion (the conditional buyout rate or CBR). Since FHA and VA loans, the primary collateral backing Ginnie Mae MBS, typically behave differently, we analyze their performance separately. The analysis that follows is based on all 30-year FHA and VA loans originated since 2014 that are included in Ginnie Mae pools. The chart below illustrates the dramatic convergence in speeds relative to the end of 2016 when VA loans were paying 7% to 8% faster than FHA loans.

delinquencies and buyouts.PNG

Deconstructing the overall prepayment rate reveals that the convergence is due to both a narrowing of the CRR difference along with a spike in the CBR for FHA loans beginning in June of this year.

Serious delinquencies are a leading indicator of future buyouts. Comparing the percentage of 90-day (or more) delinquencies as a percentage of the outstanding balance indicates a fairly consistent difference (54 bps on average) between FHA and VA loans, with both trending upward.

delinquencies and buyouts.PNG

Aging Effects

If we further stratify the loans based on vintage and look at the patterns as the loans age, will there be any material differences?

The 2014 vintage FHA cohort has performed poorly based on the buyout rate relative to the newer vintages. The 2016 vintage appears to be aging in a similar manner to the 2015 vintage while the early results for the 2017 cohort place it somewhere between the 2014 and 2015 vintages. All of the VA vintages have experienced fewer buyouts than their FHA counterparts. The 2016 VA cohort is the standout thus far followed by the 2015 and 2014 vintages. With only a few months of data to go on, the 2017 VA loans are outperforming the 2014 and 2015 loans, but are not as stellar as the 2016s.

delinquencies and buyouts.PNG

The patterns largely carry over to the 90-day or more delinquencies. 2014 vintage FHA loans generally show the highest serious delinquency percentage at any given age. However, the 2015 cohort has experienced a sharp uptick beginning at 27 months and, at an age of 31 months, exceeds the 2014 level. VA loans do not exhibit a meaningful difference among the vintages.

delinquencies and buyouts.PNG

Conclusion

Buyouts should be a consideration for Ginnie Mae investors, particularly for FHA loans. The analysis has shown that buyout rates are significantly higher for FHA loans relative to VA loans. With the CBR for FHA loans averaging 3.2x higher than the VA CBR over the last twelve months it needs to be factored into the investment equation.


Back-Testing: Using RS Edge to Validate a Prepayment Model

Most asset-liability management (ALM) models contain an embedded prepayment model for residential mortgage loans. To gauge their accuracy, prepayment modelers typically run a back-test comparing model projections to the actual prepayment rates observed. A standard test is to run a portfolio of loans as of a year ago using the actual interest rates experienced during this time as well as any additional economic factors used by the model such as home price appreciation or the unemployment rate. This methodology isolates the model’s ability to estimate voluntary payoffs from its ability to forecast the economic variables.

The graph below was produced from such a back-test. The residential mortgage loans in the bank’s portfolio as of 10/31/2016 were run through the ALM model (projections) and compared with the observed speeds (actuals). It is apparent that the model did not do a particularly good job forecasting the actual CPRs, as the mean absolute error is 5.0%. Prepayment model validators typically prefer to see mean absolute error rates no higher than 1 to 2%.

Does this mean there is something unique with the bank’s loan portfolio or servicing practices that would cause prepays to deviate from expectations, or does the prepayment model require calibration?

Dissecting the Problem

One strategy is to compare the bank’s prepayment experience to that of the market (see below). The “market” is the universe of comparable loans, in this case residential, conventional loans. This assessment should indicate whether the bank’s portfolio is unique or if it behaves similar to the market. Although this comparison looks better, there are still some material differences, especially at the beginning and end of the time series. 

Examining the portfolio composition reveals a number of differences which could be the source of the discrepancy. For example:

  • Larger-balance loans have a greater refinance incentive.
  • California loans historically prepay faster than the rest of the country, while New York loans are historically slower.
  • Broker and correspondent loans typically pay faster than retail originations.

To compensate, the next step is to adjust the market portfolio to more closely mirror the attributes of the bank’s portfolio. Fine-tuning the “market” so that it better aligns with the bank’s channel and geographic breakout, as well as its larger average loan size, results in the following adjusted prepayment speeds.

Conclusion

Prepayments for the bank’s mortgage portfolio track the market speeds reasonably well with no adjustments. Compensating for the differences in composition related to channel, geography, and loan size tracks even better and results in a mean absolute error of only 1.1%. This indicates that there is nothing unique or idiosyncratic with the bank’s portfolio that would cause projections from a market-based prepayment model to deviate significantly from the observed speeds. Consequently, the ALM prepayment model likely needs adjustments to its tuning parameters to better capture the current environment.


Non-Qualified Mortgage Securitization Market

Since 2015, a new tier of the private-label residential mortgage-backed securities (PLS) market has emerged, with securities collateralized by non-qualified mortgage (non-QM) loans. These securities enable mortgage lenders to serve borrowers with non-traditional credit profiles.

The financial crisis ushered in a sharp reduction in mortgage credit available to certain groups of borrowers. Funding sources, such as the PLS market, which once provided access for borrowers with credit blemishes, non-traditional income sources, or the desire for expanded product features were virtually eliminated.

The limited issuance of private-label RMBS since the financial crisis has generally consisted of new origination jumbo “prime” mortgage loans. These securities have included loans that meet the “qualified mortgage” (QM) standard with strong credit scores, pristine payment history, and fully documented income and assets. The non-QM market addresses a previously underserved market and reflects the expanding credit policies of many institutions.

What is a Non-Qualified Mortgage Loan?

Since the crisis, standards governing the majority of mortgage loan production have generally followed the restrictive credit criteria implemented by the GSEs. This has prompted some consumers and lenders to seek alternative products that may not meet the “qualified mortgage” requirements or the high-credit-quality standards of the GSEs. These tightened credit standards have restricted home ownership opportunities for certain groups of consumers. These groups include self-employed individuals and borrowers with weaker credit or a recent credit event, such as a foreclosure, short sale, or deed in lieu of foreclosure. While many of these potential borrowers can meet the criteria of the ‘ability-to-repay’ rule and have taken steps to improve their credit standing, they nevertheless are not able to meet the very high credit standards that have emerged since the financial crisis.

To meet the demand of these underserved borrowers, a number of lenders have begun to expand their credit parameters. As lenders have sought funding sources for these non-QM originations, a new tier of the PLS market has emerged. While it is difficult to create generic categories that define the origination practices of the various lenders, some high-level similarities can be observed in the following non-QM products and programs established to meet borrower demand:

  • Alternative Documentation – the borrower’s income is assessed through sources other than available tax returns, business earnings, or Appendix Q requirements. Many non-QM lenders offer variations of bank statement programs (e.g., 24-month review and 12-month review) to determine a self-employed borrower’s ability to repay through analysis of their monthly cash flow.
  • Borrowers with Non-Standard Credit Profile
    • Expanded Credit – borrowers with weaker FICO scores, a recent delinquency on a mortgage, a debt-to-income ratio slightly above the qualified mortgage requirements, or higher loan-to-value ratios.
    • Prior Credit Event – borrowers with recent foreclosure, bankruptcy, or other loss mitigation disposition that have not met the seasoning requirements established by GSE guidelines.
  • Investor Program – financing for investors purchasing 1-4 family rental properties that may not meet GSE guidelines.
  • Foreign National Program – financing for borrowers that are not permanent residents or do not have credit history in the United States.
  • Non-QM Product Features – financing for products that do not meet qualified mortgage guidelines, such as loans with interest-only or balloon features.

Each of these programs evaluate many aspects of the loan during the underwriting process but primarily rely on an evaluation of the borrower’s ability to repay the loan to predict loan performance. These mortgage loan products and programs attempt to meet the housing finance needs of underserved borrowers while assessing the increased risk associated with the expanded lending standards.

Non-QM securities are likely to experience more performance volatility and higher realized losses than their jumbo prime counterparts in negative economic scenarios. This is due to weaker credit profiles among non-QM borrowers, product features that do not meet “qualified mortgage” requirements (e.g., interest-only, balloon payments, prepayment penalties), and alternative methods to assess the borrower’s ability-to-repay. Investors in these securities are challenged to assess the magnitude of the increased risk of loss (net of protection provided by credit enhancement levels) versus the incremental yield provided by the securities.

Overview of Non-Prime Issuers

The non-QM sector has been created and led by non-bank financial institutions that have filled the void left by regulated banking entities that have reduced their footprint in the mortgage market. Most financial institutions that have entered the non-QM mortgage space during the past five years have received financial backing from asset managers, hedge funds or private equity firms. Securitization activity for this sector of the PLS market began in 2015 and has increased slowly since. The table below reflects the strong growth in issuance activity for non-QM securitizations between January 2015 and September 2017:

Next Market Phase

The push by mortgage lenders to expand their credit criteria and provide consumers with “affordability” products combined with investor demand for higher yielding investments set the stage for the financial crisis of 2007-2008. Bolstered by strong demand from investors for mortgage-backed securities, mortgage lenders expanded underwriting guidelines to allow borrowers with weaker credit profiles, smaller down-payment amounts, and limited or no verification of income or assets to qualify for mortgages. Weakened underwriting standards were combined with product features that slowed repayment of principal through interest-only, negative amortization and loan term extension features.

History has shown that the combination of these credit guideline expansions with weaker PLS processes resulted in historic losses. As a reaction to the abysmal credit performance of mortgage loans originated between 2005 and 2007, credit availability in the mortgage market contracted dramatically. The swing of the credit pendulum resulted in significant improvement in the credit performance of loans originated since 2008. This improved performance, however, came at the cost of shutting a large segment of the population out of the mortgage market. Now almost a decade later, the pendulum appears to be swinging back in favor expanding credit criteria to accommodate more non-QM borrowers. Time will tell whether the market has learned and will remember the lessons of the financial crisis.


Tuning Machine Learning Models

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”

Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate hyperparameters. This post focuses on three:

  • Grid Search
  • Random Search
  • Bayesian Optimization

Grid Search

Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.

For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:

equations.PNG

Then the pairing of the selected hyperparametric values, H, can take on any of the following:

Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.

Random Search

Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).

equation 2.PNG

To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).

Table 1 – Grid of Hyperparameter Values and Scores

Highlighted in green are the 21 pairings with the highest scores out of the 450 total combinations. Let’s take these 21 pairings to be our desired target range. What if we were to sample points from this grid to see if any lands within the target? Each random draw has a 21/450 ≈ 4.67% of doing so. If we randomly select 60 points, all independent of one another, then the probability that none of them land in the target, or in other words all of them miss, is
equation 3.PNG

Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.

In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value.  In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.

There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.

Bayesian Optimization

The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.

  • A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
  • A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.

The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.

There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.

Probability of Improvement

One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:

equation 5.PNG

Where,

equation 6.PNG

In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.

Expected Improvement

Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:

equation 7.PNG

Gaussian Process Upper Confidence Bound

Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:

equation 8.PNG

There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.

Conclusion

Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.



Machine Learning Model Selection

Machine learning model selection is the second step of the machine learning process, following variable selection and data cleansing. Selecting the right machine learning model is a critical step, as a model which does not appropriately fit the data will yield inaccurate results. Model selection largely depends on the goal of the model – is the purpose to explore the relationship between the variables or to maximize predictive power? In this blog, we cover a few key concepts of machine learning model selection, including parametic vs. non-parametic models, key metrics for managing the variance-bias tradeoff, and an introduction to a few standard machine learning models.

Parametric vs. Non-Parametric Tradeoffs

One of the first choices to be made in the model selection process pertains to our assumption about the shape of the functional relationship between our explanatory variables (our given, or input, variables) and our response variable (the output that we want to predict). When we choose to assume the shape of our model, we are constructing a parametric model, and our problem reduces to estimating a set of measurable factors, known as parameters.1 One of the most common assumptions is that the data is linear. While we can relax the linear assumption when necessary, we sometimes do not want to assume the shape of the function at all. Non-parametric models help to avoid the case where we incorrectly assume a function that does not match the data. However, a much larger number of observations must be obtained to make non-parametric methods effective, which can be costly or even infeasible.2

In addition to the fact that non-parametric methods are often not practical, there are other tradeoffs to take into consideration. One important tradeoff is between interpretability and flexibility. Since non-parametric models follow the data closely, they often result in abnormally shaped plots, which can be difficult to interpret. If the goal is to make sense of and model the relationship between the explanatory variable and the response, we may be willing to trade some predictive power for a parametric curve that is more understandable. If, however, we are comfortable constructing a “black-box” in hopes of maximizing the predictive power of the model, then non-parametric models may be suitable.Another important tradeoff is that of variance versus bias . Variance, in the context of statistical learning, refers to the amount by which our prediction would change if we had used a different training dataset for our estimation. Bias refers to the error resulting from approximating a complex relationship by using a simplified representation of it. In general, more flexible (non-parametric) methods tend to have higher variance and lower bias, with the opposite being true of less flexible (parametric) models. Ideally though, we want a model that has low variance and low bias. To find it, we most frequently rely on three important tools: R-squared, residual standard error, and diagnostic plots.

R-Squared, Residual Standard Error, and Plots

R-squared—formally, the “coefficient of determination”—measures the amount of variance in the response variable that is explained by the explanatory variables. Constrained between 0 and 1, a very low R-squared can indicate problems with model fit, while a very high R-squared can sometimes indicate overfitting. Residual standard error (RSE) estimates variance in the data. RSE depends on the residual sum of squares—the variation in the data left unexplained after the regression has been run—the number of observations, and the number of explanatory variables.

Graphical plots complement R-squared and RSE. Plots can be as simple as plotting the response variable against a single explanatory variable or against a fitted linear model. This can be useful for detecting non-linearity, but other plots have broader application.

One such plot is the residual plot, which plots the residuals—the difference between the true response variables and the fitted values—and the fitted values themselves. Patterns in residual plots can suggest a lack of model fit, perhaps due to non-constant variance or non-linearity in the data. Outliers and leverage points3 can also be detected through standardized residual, Normal QQ plots, and leverage point/Cook’s distance plots.

Observing these diagnostic plots enables us to make decisions as to what functional form our variables should take. For instance, by taking a logarithmic function (a curved function) of our response variable, we can help to account for non-constant variance in our model, or a non-linear relationship with the explanatory variables. We can also relax the additive assumption in a linear model by adding multiplicative combinations of variables—a technique that helps to model a synergistic relationship between variables.

Machine Learning Models: Shrinkage Methods, Splines, and Decision Trees

Our goal is to determine the model with the highest probability of having realistically generated the data, and we have summarized above the most important metrics that can help us identify such a model. However, it is also important to be aware of several standard models—to know ahead of time which are likely to be most useful.

Shrinkage methods are an alternative to the standard linear model and most notably include ridge and lasso regressions. While these models are similar to ordinary least squares, they include a shrinkage “penalty” which shrinks the coefficients, as an increasing function of their magnitude, toward zero. Through adding this constraint, the model can offer a sizeable reduction in variance in exchange for a slight increase in bias. A tuning parameter—a coefficient on this penalty—can help us fine-tune the amount of variance we want to eliminate, as well as bias we are willing to accept.4

If we are looking for a model with more flexibility and predictive power, splines may be an avenue to explore. Splines introduce several “knots” into the model, creating a smooth, continuous line with many different slopes. Unsurprisingly, since splines are much more flexible than linear regression or shrinkage methods, they have a lower bias due to following the data more closely. They also do a better job than polynomial regressions, as they provide more consistent estimates.5 

A third option is decision trees, which provide more flexibility, but are also highly interpretable due to the way they segment the problem into a hierarchical structure. The idea is to segment the set of possible values for the random variables into a distinct number of regions and make the same prediction for each observation in a particular region. This is generally done using an algorithm to select the most meaningful way to segment the observations, then the next most, and so on. Once this iterative algorithm is complete, we are left with what is usually a complex, hierarchical tree-like structure that can be readily mapped into a highly intuitive visualization. Decision trees can be very useful for their interpretability, ability to model non-linear data, and arguably more realistic approach to modeling human decision-making.

Application to Finance and Mortgage Data

We can use machine learning to answer a wide variety of questions related to finance and mortgage data, but it is crucial to understand the model selection process. Strong domain knowledge can help considerably in knowing what assumptions would be plausible, but a knowledge of diagnostic metrics, as well as the different types of models, their strengths, and weaknesses, can help unlock insights and uncover the logic behind processes—especially when answering questions that have yet to be answered. Whether your goal is to identify which customers are most likely to default on a loan, determine the elasticity of demand for a certain type of loan, or cut out some of the noise in the data, a solid grounding in approaches to model selection can help significantly.

 

[1] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Introduction to Statistical Learning (New York: Springer, 2013), 21-22.
[2] James, Witten, Hastie, and Tibshirani, 23.
[3] Outliers are Y values that are unusual given the explanatory variables. Leverage points are X values that are surprising given the response variables.
[4] James, Witten, Hastie, and Tibshirani, 218.
[5] James, Witten, Hastie, and Tibshirani, 276.


End-User Computing Controls – Building an EUC Inventory

An accounting manager at a mid-sized bank recently wondered aloud to us how to approach implementing end-user computing controls (EUC).  She had recently become responsible for identifying and overseeing her institution’s unknown number of EUC applications and had obviously given a lot of thought to the types of applications that needed to be identified and what the review process ought to look like. She recognized that a comprehensive inventory would need to be built, but, like so many others in her position, was uncertain of how to go about it.

We reasoned together that her options fell into two broad categories—each of which has benefits and drawbacks.

The first category of inventory-building options we classified as a top-down approach. This begins with identifying all data contained in financial statements or mission-critical management reports and then working backward from there to identify every model, database, spreadsheet, or other application that is used to generate these reports. The second category is a bottom-up approach, which first identifies every single spreadsheet in use at the bank and then determines which of those rise to the level of EUCs and need to be formally and independently reviewed.

 

Top-Down EUC Inventory Building

The primary advantage of a top-down approach is the comfort of knowing that everything important has been accounted for. An EUC inventory that is built systematically by tracing every figure on every balance sheet, income statement, and footnote back to every spreadsheet that contributed to it is not likely to miss much. Top-down approaches have the added benefit of placing the EUC inventory coordinator firmly in control of the exercise because she knows precisely what she is looking for. “We’re forecasting $23 million in retail deposit runoff next month,” she might observe. “Someone needs to show me the system that generated that figure. And if it’s a spreadsheet, then it needs an EUC review.”

The downside is that this exercise usually turns out to be more complicated than it sounds. One problem with requests that begin with “Somebody needs to show me…” is that “somebody” can often be hard to track down. Also, “somebody” many times is “somebodies.” Individual financial statement line items are often supported by multiple spreadsheets, and those spreadsheets may have data-feed issues of their own. What begins looking like it should be a straightforward exercise quickly evolves into one of those dreaded “spaghetti bowl” problems where attempting to extract a single strand leads to a tangled mess. A single required line item—say, cash required for loan originations in the next 90 days—would likely require input from a half-dozen or more EUCs tracking everything from economic forecasts to pipeline reports for any number of different loan types and origination channels. Before long, the person in charge of end-user computing controls can begin to feel like she’s been placed in charge of auditing not just EUCs, but the entire bank.

 

Bottom-Up EUC Inventory Building

A more common means to building an EUC inventory is a bottom-up approach that identifies every spreadsheet on the network and then relies on a combination of manual and automated methods to sort them into one of three bins:

  1. Models (which have hopefully already been tagged and classified during a separate model-inventory-building process)
  2. Non-computational/non-relevant spreadsheets (spreadsheets that either contain data only and do not perform calculations or spreadsheets that do not contribute to a quantitative business purpose—e.g., leave schedules, org charts, and fantasy football standings)
  3. EUCs (pretty much everything that does not get filtered into the first two bins)

Identifying all the spreadsheets can be done manually or using an automated “discovery” tool. Even in the very smallest institutions, manual discovery is too big a job for a single person. Typically, individual business unit heads will be tasked with identifying all of the EUCs in use within their various realms and reporting them to a central EUC oversight coordinator. The advantage of this approach is that it enables non-EUC spreadsheets to be filtered out before they get to the central EUC oversight coordinator, which makes that person’s job easier. The disadvantage is that it is unlikely to capture every EUC. Business unit heads are incentivized to apply a sub-optimal set of criteria when determining whether a spreadsheet should be classified as an EUC. They are likely to overlook files that an impartial EUC coordinator might wish to review.

An automated discovery tool avoids this problem by grabbing everything—every spreadsheet in a given shared drive or folder structure and then scanning and evaluating them for formulas and levels of complexity that contribute to an EUC’s risk rating. Automated scanning tools have the dual benefit of enabling central EUC coordinators to peer into how individual business units are using spreadsheets without having to rely on the judgment of business unit heads to determine what is worthy of review. The downside is that, even with all the automated filtering discovery tools are capable of, they are likely to result in the “discovery” of a lot of spreadsheets that ultimately do not need to go through an EUC review. Paradoxically, the more automated the discovery process is, the more manual the winnowing needs to be.

 

A Hybrid Approach to End-User Computing Controls

As with many things, the best solution probably lies somewhere in the middle—drawing from the benefits of both top-down and bottom-up approaches.

While a pure top-down approach is usually too involved to be practical on its own, elements of a top-down approach can enlighten and facilitate a bottom-up process. For example, a bottom-up process may identify several spreadsheets whose complexity and perceived importance to the departments that use them make them appear to be high-risk EUCs in need of review. However, a top-down review may reveal that these spreadsheets ultimately do not contribute to financial or enterprise-wise management reporting. It could be that the importance of some spreadsheets does not extend far enough beyond the business unit that owns them to require an independent review. Furthermore, being able to connect the dots between spreadsheets that are identified using a bottom-up approach and individual financial statement/management report entries can help ensure that all important entries are accounted for.

A hybrid approach—one that is informed both by an understanding of critical reporting items and a series of comprehensive, automated discovery scans—introduces the virtues of both methods and is most likely to yield an EUC inventory that is both comprehensive and aligned with an institution’s risk profile.


Get Started
Log in

Linkedin   

risktech2024