RiskSpan, Author at RiskSpan

Big Data in Small Dimensions: Machine Learning Methods for Data Visualization

Analysts and data scientists are constantly seeking new ways to parse increasingly intricate datasets, many of which are deemed “high dimensional”, i.e., contain many (sometimes hundreds or more) individual variables. Machine learning has recently emerged as one such technique due to its exceptional ability to process massive quantities of data. A particularly useful machine learning method is t-distributed stochastic neighbor embedding (t-SNE), used to summarize very high-dimensional data using comparatively few variables. T-SNE visualizations allow analysts to identify hidden structures that may have otherwise been missed.

Traditional Data Visualization

The first step in tackling any analytical problem is to develop a solid understanding of the dataset in question. This process often begins with calculating descriptive statistics that summarize useful characteristics of each variable, such as the mean and variance. Also critical to this pursuit is the use of data visualizations that can illustrate the relationships between observations and variables and can identify issues that must be corrected. For example, the chart below shows a series of pairwise plots between a set of variables taken from a loan-level dataset. Along the diagonal axis the distribution of each individual variable is plotted.

The plot above is useful for identifying pairs of variables that are highly correlated as well as variables that lack variance, such as original loan term. When dealing with a larger number of variables, heatmaps like the one below can summarize the relationships between the data in a compact way that is also visually intuitive.

The statistics and visualizations described so far are helpful for summarizing and identifying issues, but they often fall short in telling the entire narrative of the data. One issue that remains is a lack of understanding of the underlying structure of the data. Gaining this understanding is often key to selecting the best approach for problem solving.

Enhanced Data Visualization with Machine Learning

Humans can visualize observations plotted with up to three variables (dimensions), but with the exponential rise in data collection it is now abnormal to only be dealing with a handful of variables. Thankfully, there are new machine learning methods that can help overcome our limited capacity and deliver new insights never seen before.

T-SNE is a type of non-linear dimensionality reduction algorithm. While this is a mouthful, the idea behind it is straightforward: t-SNE takes data that exists in very high dimensions and produces a plot in two or three dimensions that can be observed. The plot in low dimensions is created in such a way that observations close to each other in high dimensions remain close together in low dimensions. Additionally, t-SNE has proven to be good at preserving both the global and local structures present within the data¹, which is of critical importance.

The full technical details of t-SNE are beyond the scope of this blog, but a simplified version of the steps for t-SNE are as follows:

Compute the Euclidean distance between each pair of observations in high-dimensional space.
Using a Gaussian distribution, convert the distance between each pair of observations into a probability that represents similarity between the points.
Randomly place the observations into low-dimensional space (usually 2 or 3).
Compute the distance and similarity (as in steps 1 and 2) for each pair of observations in the low-dimensional space. Crucially, in this step a Student t-distribution is used instead of a normal Gaussian.
Using gradient based optimization, iteratively nudge the observations in the low-dimensional space in such a way that the probabilities between pairs of observations are as close as possible to the probabilities in high dimensions.

Two key consideration are the use of the Student t-distribution in step four as opposed to the Gaussian in step two, and the random initialization of the data points in low dimensional space. The t-distribution is critical to the success of the algorithm for multiple reasons, but perhaps most importantly in that it allows clusters that initially start far apart to re-converge². Given the random initialization of the points in low dimensional space, it is common practice to run the algorithm multiple times with the same parameters to observe the best mapping and ensure that the gradient descent optimization does not get stuck in a local minima.

We applied t-SNE to a loan-level dataset comprised of approximately 40 variables. The loans are a random sample of originations from every quarter dating back to 1999. T-SNE was used to map the data into just three dimensions and the resulting plot was color-coded based on the year of origination.

In the interactive visualization below many clusters emerge. Rotating the figure reveals that some clusters are comprised predominantly of loans within similar origination years (groups of same-colored data points). Other clusters are less well-defined or contain a mix of origination years. Using this same method, we could choose to color loans with other information that we may wish to explore. For example, a mapping showing clusters related to delinquencies, foreclosure, or other credit loss events could prove tremendously insightful. For a given problem, using information from a plot such as this can enhance the understanding of the problem separability and enhance the analytical approach.

Crucial to the t-SNE mapping is a parameter set by the analyst called perplexity, which should be roughly equal to the number of expected nearby neighbors for each data point. Therefore, as the value of perplexity increases, the number of resulting clusters should generally decrease and vice versa. When implementing t-SNE, various perplexity parameters should be tried as the appropriate value is generally not known beforehand. The plot below was produced using the same dataset as before but with a larger value of perplexity. In this plot four distinct clusters emerge, and within each cluster loans of similar origination years group closely together.

How Buyouts Drive Ginnie Mae Prepayment Speeds

Because Ginnie Mae mortgage-backed securities are backed by the full faith and credit of the U.S. government, investors are not subject to credit losses. However, the potential for non-performing loan buyouts creates an additional layer of prepayment risk. As with any prepayment, investors receive the unpaid principal balance of the loan that goes through buyout. However, for all 30-year pass-throughs with 3% and higher coupons trading above par, any prepayment (due to a buyout or otherwise) represents a loss to the investor.

So how much of a concern are buyouts for investors?

Prepayments

Prepayments for Ginnie Mae MBS are comprised of a voluntary component (the conditional repayment rate, CRR) along with an involuntary portion (the conditional buyout rate or CBR). Since FHA and VA loans, the primary collateral backing Ginnie Mae MBS, typically behave differently, we analyze their performance separately. The analysis that follows is based on all 30-year FHA and VA loans originated since 2014 that are included in Ginnie Mae pools. The chart below illustrates the dramatic convergence in speeds relative to the end of 2016 when VA loans were paying 7% to 8% faster than FHA loans.

Deconstructing the overall prepayment rate reveals that the convergence is due to both a narrowing of the CRR difference along with a spike in the CBR for FHA loans beginning in June of this year.

Serious delinquencies are a leading indicator of future buyouts. Comparing the percentage of 90-day (or more) delinquencies as a percentage of the outstanding balance indicates a fairly consistent difference (54 bps on average) between FHA and VA loans, with both trending upward.

Aging Effects

If we further stratify the loans based on vintage and look at the patterns as the loans age, will there be any material differences?

The 2014 vintage FHA cohort has performed poorly based on the buyout rate relative to the newer vintages. The 2016 vintage appears to be aging in a similar manner to the 2015 vintage while the early results for the 2017 cohort place it somewhere between the 2014 and 2015 vintages. All of the VA vintages have experienced fewer buyouts than their FHA counterparts. The 2016 VA cohort is the standout thus far followed by the 2015 and 2014 vintages. With only a few months of data to go on, the 2017 VA loans are outperforming the 2014 and 2015 loans, but are not as stellar as the 2016s.

The patterns largely carry over to the 90-day or more delinquencies. 2014 vintage FHA loans generally show the highest serious delinquency percentage at any given age. However, the 2015 cohort has experienced a sharp uptick beginning at 27 months and, at an age of 31 months, exceeds the 2014 level. VA loans do not exhibit a meaningful difference among the vintages.

Conclusion

Buyouts should be a consideration for Ginnie Mae investors, particularly for FHA loans. The analysis has shown that buyout rates are significantly higher for FHA loans relative to VA loans. With the CBR for FHA loans averaging 3.2x higher than the VA CBR over the last twelve months it needs to be factored into the investment equation.

Back-Testing: Using RS Edge to Validate a Prepayment Model

Most asset-liability management (ALM) models contain an embedded prepayment model for residential mortgage loans. To gauge their accuracy, prepayment modelers typically run a back-test comparing model projections to the actual prepayment rates observed. A standard test is to run a portfolio of loans as of a year ago using the actual interest rates experienced during this time as well as any additional economic factors used by the model such as home price appreciation or the unemployment rate. This methodology isolates the model’s ability to estimate voluntary payoffs from its ability to forecast the economic variables.

The graph below was produced from such a back-test. The residential mortgage loans in the bank’s portfolio as of 10/31/2016 were run through the ALM model (projections) and compared with the observed speeds (actuals). It is apparent that the model did not do a particularly good job forecasting the actual CPRs, as the mean absolute error is 5.0%. Prepayment model validators typically prefer to see mean absolute error rates no higher than 1 to 2%.

Does this mean there is something unique with the bank’s loan portfolio or servicing practices that would cause prepays to deviate from expectations, or does the prepayment model require calibration?

Dissecting the Problem

One strategy is to compare the bank’s prepayment experience to that of the market (see below). The “market” is the universe of comparable loans, in this case residential, conventional loans. This assessment should indicate whether the bank’s portfolio is unique or if it behaves similar to the market. Although this comparison looks better, there are still some material differences, especially at the beginning and end of the time series.

Examining the portfolio composition reveals a number of differences which could be the source of the discrepancy. For example:

Larger-balance loans have a greater refinance incentive.
California loans historically prepay faster than the rest of the country, while New York loans are historically slower.
Broker and correspondent loans typically pay faster than retail originations.

To compensate, the next step is to adjust the market portfolio to more closely mirror the attributes of the bank’s portfolio. Fine-tuning the “market” so that it better aligns with the bank’s channel and geographic breakout, as well as its larger average loan size, results in the following adjusted prepayment speeds.

Conclusion

Prepayments for the bank’s mortgage portfolio track the market speeds reasonably well with no adjustments. Compensating for the differences in composition related to channel, geography, and loan size tracks even better and results in a mean absolute error of only 1.1%. This indicates that there is nothing unique or idiosyncratic with the bank’s portfolio that would cause projections from a market-based prepayment model to deviate significantly from the observed speeds. Consequently, the ALM prepayment model likely needs adjustments to its tuning parameters to better capture the current environment.

Non-Qualified Mortgage Securitization Market

Since 2015, a new tier of the private-label residential mortgage-backed securities (PLS) market has emerged, with securities collateralized by non-qualified mortgage (non-QM) loans. These securities enable mortgage lenders to serve borrowers with non-traditional credit profiles.

The financial crisis ushered in a sharp reduction in mortgage credit available to certain groups of borrowers. Funding sources, such as the PLS market, which once provided access for borrowers with credit blemishes, non-traditional income sources, or the desire for expanded product features were virtually eliminated.

The limited issuance of private-label RMBS since the financial crisis has generally consisted of new origination jumbo “prime” mortgage loans. These securities have included loans that meet the “qualified mortgage” (QM) standard with strong credit scores, pristine payment history, and fully documented income and assets. The non-QM market addresses a previously underserved market and reflects the expanding credit policies of many institutions.

What is a Non-Qualified Mortgage Loan?

Since the crisis, standards governing the majority of mortgage loan production have generally followed the restrictive credit criteria implemented by the GSEs. This has prompted some consumers and lenders to seek alternative products that may not meet the “qualified mortgage” requirements or the high-credit-quality standards of the GSEs. These tightened credit standards have restricted home ownership opportunities for certain groups of consumers. These groups include self-employed individuals and borrowers with weaker credit or a recent credit event, such as a foreclosure, short sale, or deed in lieu of foreclosure. While many of these potential borrowers can meet the criteria of the ‘ability-to-repay’ rule and have taken steps to improve their credit standing, they nevertheless are not able to meet the very high credit standards that have emerged since the financial crisis.

To meet the demand of these underserved borrowers, a number of lenders have begun to expand their credit parameters. As lenders have sought funding sources for these non-QM originations, a new tier of the PLS market has emerged. While it is difficult to create generic categories that define the origination practices of the various lenders, some high-level similarities can be observed in the following non-QM products and programs established to meet borrower demand:

Alternative Documentation – the borrower’s income is assessed through sources other than available tax returns, business earnings, or Appendix Q requirements. Many non-QM lenders offer variations of bank statement programs (e.g., 24-month review and 12-month review) to determine a self-employed borrower’s ability to repay through analysis of their monthly cash flow.
Borrowers with Non-Standard Credit Profile
- Expanded Credit – borrowers with weaker FICO scores, a recent delinquency on a mortgage, a debt-to-income ratio slightly above the qualified mortgage requirements, or higher loan-to-value ratios.
- Prior Credit Event – borrowers with recent foreclosure, bankruptcy, or other loss mitigation disposition that have not met the seasoning requirements established by GSE guidelines.
Investor Program – financing for investors purchasing 1-4 family rental properties that may not meet GSE guidelines.
Foreign National Program – financing for borrowers that are not permanent residents or do not have credit history in the United States.
Non-QM Product Features – financing for products that do not meet qualified mortgage guidelines, such as loans with interest-only or balloon features.

Each of these programs evaluate many aspects of the loan during the underwriting process but primarily rely on an evaluation of the borrower’s ability to repay the loan to predict loan performance. These mortgage loan products and programs attempt to meet the housing finance needs of underserved borrowers while assessing the increased risk associated with the expanded lending standards.

Non-QM securities are likely to experience more performance volatility and higher realized losses than their jumbo prime counterparts in negative economic scenarios. This is due to weaker credit profiles among non-QM borrowers, product features that do not meet “qualified mortgage” requirements (e.g., interest-only, balloon payments, prepayment penalties), and alternative methods to assess the borrower’s ability-to-repay. Investors in these securities are challenged to assess the magnitude of the increased risk of loss (net of protection provided by credit enhancement levels) versus the incremental yield provided by the securities.

Overview of Non-Prime Issuers

The non-QM sector has been created and led by non-bank financial institutions that have filled the void left by regulated banking entities that have reduced their footprint in the mortgage market. Most financial institutions that have entered the non-QM mortgage space during the past five years have received financial backing from asset managers, hedge funds or private equity firms. Securitization activity for this sector of the PLS market began in 2015 and has increased slowly since. The table below reflects the strong growth in issuance activity for non-QM securitizations between January 2015 and September 2017:

Next Market Phase

The push by mortgage lenders to expand their credit criteria and provide consumers with “affordability” products combined with investor demand for higher yielding investments set the stage for the financial crisis of 2007-2008. Bolstered by strong demand from investors for mortgage-backed securities, mortgage lenders expanded underwriting guidelines to allow borrowers with weaker credit profiles, smaller down-payment amounts, and limited or no verification of income or assets to qualify for mortgages. Weakened underwriting standards were combined with product features that slowed repayment of principal through interest-only, negative amortization and loan term extension features.

History has shown that the combination of these credit guideline expansions with weaker PLS processes resulted in historic losses. As a reaction to the abysmal credit performance of mortgage loans originated between 2005 and 2007, credit availability in the mortgage market contracted dramatically. The swing of the credit pendulum resulted in significant improvement in the credit performance of loans originated since 2008. This improved performance, however, came at the cost of shutting a large segment of the population out of the mortgage market. Now almost a decade later, the pendulum appears to be swinging back in favor expanding credit criteria to accommodate more non-QM borrowers. Time will tell whether the market has learned and will remember the lessons of the financial crisis.

Tuning Machine Learning Models

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”

Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate hyperparameters. This post focuses on three:

Grid Search
Random Search
Bayesian Optimization

Grid Search

Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.

For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:

Then the pairing of the selected hyperparametric values, H, can take on any of the following:

Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.

Random Search

Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).

To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).

Table 1 – Grid of Hyperparameter Values and Scores

Highlighted in green are the 21 pairings with the highest scores out of the 450 total combinations. Let’s take these 21 pairings to be our desired target range. What if we were to sample points from this grid to see if any lands within the target? Each random draw has a 21/450 ≈ 4.67% of doing so. If we randomly select 60 points, all independent of one another, then the probability that none of them land in the target, or in other words all of them miss, is

Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.

In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value. In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.

There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.

Bayesian Optimization

The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.

A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.

The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.

There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.

Probability of Improvement

One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:

Where,

In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.

Expected Improvement

Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:

Gaussian Process Upper Confidence Bound

Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:

There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.

Conclusion

Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.

References

https://stats.stackexchange.com/questions/160479/practical-hyperparameter-optimization-random-vs-grid-search

http://scikit-learn.org/stable/modules/grid_search.html

https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning/

https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

https://arxiv.org/pdf/1206.2944.pdf

http://auai.org/uai2016/proceedings/papers/73.pdf

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf

https://www.cs.ox.ac.uk/people/nando.defreitas/publications/BayesOptLoop.pdf

https://arxiv.org/pdf/1602.02355.pdf

Machine Learning Model Selection

Machine learning model selection is the second step of the machine learning process, following variable selection and data cleansing. Selecting the right machine learning model is a critical step, as a model which does not appropriately fit the data will yield inaccurate results. Model selection largely depends on the goal of the model – is the purpose to explore the relationship between the variables or to maximize predictive power? In this blog, we cover a few key concepts of machine learning model selection, including parametic vs. non-parametic models, key metrics for managing the variance-bias tradeoff, and an introduction to a few standard machine learning models.

Parametric vs. Non-Parametric Tradeoffs

One of the first choices to be made in the model selection process pertains to our assumption about the shape of the functional relationship between our explanatory variables (our given, or input, variables) and our response variable (the output that we want to predict). When we choose to assume the shape of our model, we are constructing a parametric model, and our problem reduces to estimating a set of measurable factors, known as parameters.¹ One of the most common assumptions is that the data is linear. While we can relax the linear assumption when necessary, we sometimes do not want to assume the shape of the function at all. Non-parametric models help to avoid the case where we incorrectly assume a function that does not match the data. However, a much larger number of observations must be obtained to make non-parametric methods effective, which can be costly or even infeasible.²

In addition to the fact that non-parametric methods are often not practical, there are other tradeoffs to take into consideration. One important tradeoff is between interpretability and flexibility. Since non-parametric models follow the data closely, they often result in abnormally shaped plots, which can be difficult to interpret. If the goal is to make sense of and model the relationship between the explanatory variable and the response, we may be willing to trade some predictive power for a parametric curve that is more understandable. If, however, we are comfortable constructing a “black-box” in hopes of maximizing the predictive power of the model, then non-parametric models may be suitable.Another important tradeoff is that of variance versus bias . Variance, in the context of statistical learning, refers to the amount by which our prediction would change if we had used a different training dataset for our estimation. Bias refers to the error resulting from approximating a complex relationship by using a simplified representation of it. In general, more flexible (non-parametric) methods tend to have higher variance and lower bias, with the opposite being true of less flexible (parametric) models. Ideally though, we want a model that has low variance and low bias. To find it, we most frequently rely on three important tools: R-squared, residual standard error, and diagnostic plots.

R-Squared, Residual Standard Error, and Plots

R-squared—formally, the “coefficient of determination”—measures the amount of variance in the response variable that is explained by the explanatory variables. Constrained between 0 and 1, a very low R-squared can indicate problems with model fit, while a very high R-squared can sometimes indicate overfitting. Residual standard error (RSE) estimates variance in the data. RSE depends on the residual sum of squares—the variation in the data left unexplained after the regression has been run—the number of observations, and the number of explanatory variables.

Graphical plots complement R-squared and RSE. Plots can be as simple as plotting the response variable against a single explanatory variable or against a fitted linear model. This can be useful for detecting non-linearity, but other plots have broader application.

One such plot is the residual plot, which plots the residuals—the difference between the true response variables and the fitted values—and the fitted values themselves. Patterns in residual plots can suggest a lack of model fit, perhaps due to non-constant variance or non-linearity in the data. Outliers and leverage points³ can also be detected through standardized residual, Normal QQ plots, and leverage point/Cook’s distance plots.

Observing these diagnostic plots enables us to make decisions as to what functional form our variables should take. For instance, by taking a logarithmic function (a curved function) of our response variable, we can help to account for non-constant variance in our model, or a non-linear relationship with the explanatory variables. We can also relax the additive assumption in a linear model by adding multiplicative combinations of variables—a technique that helps to model a synergistic relationship between variables.

Machine Learning Models: Shrinkage Methods, Splines, and Decision Trees

Our goal is to determine the model with the highest probability of having realistically generated the data, and we have summarized above the most important metrics that can help us identify such a model. However, it is also important to be aware of several standard models—to know ahead of time which are likely to be most useful.

Shrinkage methods are an alternative to the standard linear model and most notably include ridge and lasso regressions. While these models are similar to ordinary least squares, they include a shrinkage “penalty” which shrinks the coefficients, as an increasing function of their magnitude, toward zero. Through adding this constraint, the model can offer a sizeable reduction in variance in exchange for a slight increase in bias. A tuning parameter—a coefficient on this penalty—can help us fine-tune the amount of variance we want to eliminate, as well as bias we are willing to accept.⁴

If we are looking for a model with more flexibility and predictive power, splines may be an avenue to explore. Splines introduce several “knots” into the model, creating a smooth, continuous line with many different slopes. Unsurprisingly, since splines are much more flexible than linear regression or shrinkage methods, they have a lower bias due to following the data more closely. They also do a better job than polynomial regressions, as they provide more consistent estimates.⁵

A third option is decision trees, which provide more flexibility, but are also highly interpretable due to the way they segment the problem into a hierarchical structure. The idea is to segment the set of possible values for the random variables into a distinct number of regions and make the same prediction for each observation in a particular region. This is generally done using an algorithm to select the most meaningful way to segment the observations, then the next most, and so on. Once this iterative algorithm is complete, we are left with what is usually a complex, hierarchical tree-like structure that can be readily mapped into a highly intuitive visualization. Decision trees can be very useful for their interpretability, ability to model non-linear data, and arguably more realistic approach to modeling human decision-making.

Application to Finance and Mortgage Data

We can use machine learning to answer a wide variety of questions related to finance and mortgage data, but it is crucial to understand the model selection process. Strong domain knowledge can help considerably in knowing what assumptions would be plausible, but a knowledge of diagnostic metrics, as well as the different types of models, their strengths, and weaknesses, can help unlock insights and uncover the logic behind processes—especially when answering questions that have yet to be answered. Whether your goal is to identify which customers are most likely to default on a loan, determine the elasticity of demand for a certain type of loan, or cut out some of the noise in the data, a solid grounding in approaches to model selection can help significantly.

[1] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Introduction to Statistical Learning (New York: Springer, 2013), 21-22.
[2] James, Witten, Hastie, and Tibshirani, 23.
[3] Outliers are Y values that are unusual given the explanatory variables. Leverage points are X values that are surprising given the response variables.
[4] James, Witten, Hastie, and Tibshirani, 218.
[5] James, Witten, Hastie, and Tibshirani, 276.

End-User Computing Controls – Building an EUC Inventory

An accounting manager at a mid-sized bank recently wondered aloud to us how to approach implementing end-user computing controls (EUC). She had recently become responsible for identifying and overseeing her institution’s unknown number of EUC applications and had obviously given a lot of thought to the types of applications that needed to be identified and what the review process ought to look like. She recognized that a comprehensive inventory would need to be built, but, like so many others in her position, was uncertain of how to go about it.

We reasoned together that her options fell into two broad categories—each of which has benefits and drawbacks.

The first category of inventory-building options we classified as a top-down approach. This begins with identifying all data contained in financial statements or mission-critical management reports and then working backward from there to identify every model, database, spreadsheet, or other application that is used to generate these reports. The second category is a bottom-up approach, which first identifies every single spreadsheet in use at the bank and then determines which of those rise to the level of EUCs and need to be formally and independently reviewed.

Top-Down EUC Inventory Building

The primary advantage of a top-down approach is the comfort of knowing that everything important has been accounted for. An EUC inventory that is built systematically by tracing every figure on every balance sheet, income statement, and footnote back to every spreadsheet that contributed to it is not likely to miss much. Top-down approaches have the added benefit of placing the EUC inventory coordinator firmly in control of the exercise because she knows precisely what she is looking for. “We’re forecasting $23 million in retail deposit runoff next month,” she might observe. “Someone needs to show me the system that generated that figure. And if it’s a spreadsheet, then it needs an EUC review.”

The downside is that this exercise usually turns out to be more complicated than it sounds. One problem with requests that begin with “Somebody needs to show me…” is that “somebody” can often be hard to track down. Also, “somebody” many times is “somebodies.” Individual financial statement line items are often supported by multiple spreadsheets, and those spreadsheets may have data-feed issues of their own. What begins looking like it should be a straightforward exercise quickly evolves into one of those dreaded “spaghetti bowl” problems where attempting to extract a single strand leads to a tangled mess. A single required line item—say, cash required for loan originations in the next 90 days—would likely require input from a half-dozen or more EUCs tracking everything from economic forecasts to pipeline reports for any number of different loan types and origination channels. Before long, the person in charge of end-user computing controls can begin to feel like she’s been placed in charge of auditing not just EUCs, but the entire bank.

Bottom-Up EUC Inventory Building

A more common means to building an EUC inventory is a bottom-up approach that identifies every spreadsheet on the network and then relies on a combination of manual and automated methods to sort them into one of three bins:

Models (which have hopefully already been tagged and classified during a separate model-inventory-building process)
Non-computational/non-relevant spreadsheets (spreadsheets that either contain data only and do not perform calculations or spreadsheets that do not contribute to a quantitative business purpose—e.g., leave schedules, org charts, and fantasy football standings)
EUCs (pretty much everything that does not get filtered into the first two bins)

Identifying all the spreadsheets can be done manually or using an automated “discovery” tool. Even in the very smallest institutions, manual discovery is too big a job for a single person. Typically, individual business unit heads will be tasked with identifying all of the EUCs in use within their various realms and reporting them to a central EUC oversight coordinator. The advantage of this approach is that it enables non-EUC spreadsheets to be filtered out before they get to the central EUC oversight coordinator, which makes that person’s job easier. The disadvantage is that it is unlikely to capture every EUC. Business unit heads are incentivized to apply a sub-optimal set of criteria when determining whether a spreadsheet should be classified as an EUC. They are likely to overlook files that an impartial EUC coordinator might wish to review.

An automated discovery tool avoids this problem by grabbing everything—every spreadsheet in a given shared drive or folder structure and then scanning and evaluating them for formulas and levels of complexity that contribute to an EUC’s risk rating. Automated scanning tools have the dual benefit of enabling central EUC coordinators to peer into how individual business units are using spreadsheets without having to rely on the judgment of business unit heads to determine what is worthy of review. The downside is that, even with all the automated filtering discovery tools are capable of, they are likely to result in the “discovery” of a lot of spreadsheets that ultimately do not need to go through an EUC review. Paradoxically, the more automated the discovery process is, the more manual the winnowing needs to be.

A Hybrid Approach to End-User Computing Controls

As with many things, the best solution probably lies somewhere in the middle—drawing from the benefits of both top-down and bottom-up approaches.

While a pure top-down approach is usually too involved to be practical on its own, elements of a top-down approach can enlighten and facilitate a bottom-up process. For example, a bottom-up process may identify several spreadsheets whose complexity and perceived importance to the departments that use them make them appear to be high-risk EUCs in need of review. However, a top-down review may reveal that these spreadsheets ultimately do not contribute to financial or enterprise-wise management reporting. It could be that the importance of some spreadsheets does not extend far enough beyond the business unit that owns them to require an independent review. Furthermore, being able to connect the dots between spreadsheets that are identified using a bottom-up approach and individual financial statement/management report entries can help ensure that all important entries are accounted for.

A hybrid approach—one that is informed both by an understanding of critical reporting items and a series of comprehensive, automated discovery scans—introduces the virtues of both methods and is most likely to yield an EUC inventory that is both comprehensive and aligned with an institution’s risk profile.

Evaluating Supervised and Unsupervised Learning Models

Model evaluation (including evaluating supervised and unsupervised learning models) is the process of objectively measuring how well machine learning models perform the specific tasks they were designed to do—such as predicting a stock price or appropriately flagging credit card transactions as fraud. Because each machine learning model is unique, optimal methods of evaluation vary depending on whether the model in question is “supervised” or “unsupervised.” Supervised machine learning models make specific predictions or classifications based on labeled training data, while unsupervised machine learning models seek to cluster or otherwise find patterns in unlabeled data.

Unsupervised Learning

Common unsupervised learning techniques include clustering, anomaly detection, and neural networks. Each technique calls for a different method of evaluating performance. We’ll focus on clustering models as an example. Clustering is the task of grouping a set of objects in such a way that objects in the same cluster are more like each other than they are to objects in other clusters. Various algorithms are capable of clustering, including k-means and hierarchical, which differ in their definitions of a cluster and how to find one.

Evaluating Unsupervised Learning Models

Let’s assume that we need to cluster banking customers together into groups based on the amount and magnitude of risk they pose. After the clustering algorithm has grouped the customers into distinct clusters, we need to evaluate how well those clusters were formed. The lack of labels on an unsupervised learning model’s training data makes evaluation problematic because there is nothing to which the model’s results can be meaningfully compared. If we were to manually group these customers, we could then compare our manual groupings with the algorithm’s, but often this is not an option due to time or labor constraints, so we need a more efficient way to determine how well the algorithm performed.

One way would be to determine 1) how close each customer within each cluster is to every other customer in its cluster (the “intra-cluster” distance”) and 2) how close each cluster of customers is to other clusters (the “inter-cluster” distance), and then to compare the two distances. Models that produce relatively small intra-cluster distances and relatively large inter-cluster distances evaluate favorably because they appear to be doing a good job of grouping like customers with discrete characteristics.

Supervised Learning

Within supervised learning there are techniques for both regression and classification tasks. While some techniques are suited to either regression or classification, some can be used for both. For example, linear regression can only be used for regression while support vector machines and random forests can be used for either. While each of these is a different technique, the metrics that we use to evaluate them are the same, so we can even compare these models to one another. In our examples, we’ll focus on flagging credit card purchases as fraud, a classification task, and predicting housing prices, a regression task.

Evaluating Supervised Models

The task of evaluating how well a supervised learning model performs is more straightforward. Because supervised learning models learn from labeled training data, once they have been fitted using training data, they can be tested against data from the same population and therefore has the same labels.

For example, let’s say we need to classify whether a credit card transaction is fraudulent and we have a dataset of transactions with labels of either “fraud” or “not fraud.” We can (and sometimes do¹) train our model on all the available data, but this prevents us from fairly evaluating it because no “independent” data remains for testing and overfitting² becomes difficult to detect. This problem can be avoided by splitting the available data into training and testing sets.

This can be accomplished in various ways. For simplicity, we’ll first talk about splitting our dataset into two sets: a training set (typically 70% of the whole dataset) from which the model learns and a test set (the other 30%). Because the test set is withheld from the model during training, it can contribute to an unbiased evaluation of how well a model performs on previously unseen data. This protects against overfitting and allows us to evaluate how our model would perform “in the wild” on new data as it emerges.

Cross-validation is another antidote for overfitting. Cross-validation involves partitioning data into multiple groups and then training and testing models on different group combinations. For example, in a 5-fold cross-validation we would split our transaction data set into five partitions of equal sizes. We would then train our model on four of those five partitions and test our model on the remaining partition. We would then repeat the process—selecting a different partition to be the test group and training a new model on the remaining set of four partitions. We would repeat three more times, for a total of five rounds of cross-validation, one for each fold. We will then have five different models, each having been trained and tested on a different subset of data and each having their own weights and prediction accuracy. At the end, we combine these models by averaging their weights together to estimate a final predictive model.

Classification metrics are the measures against which models are evaluated. The simplest and most common such metric is accuracy. Accuracy is computed by dividing the number of correct predictions by the total number of predictions. In our supervised transaction classification model example, if we tested our model on one hundred transactions and correctly predicted their label (fraud/not fraud) for ninety-five of them, then the accuracy of our model is 95%.

Accuracy is the simplest, most understandable metric we can use, but we wouldn’t want to rely on accuracy alone because it doesn’t distinguish between false positives, transactions incorrectly classified as fraud, and false negatives, transactions incorrectly classified as non-fraud. For this we need a confusion matrix.

A confusion matrix is a 2-by-2 table that sorts predictions into one of four classifications: true positive, true negative, false positive, and false negative. Our transaction classification model might generate a confusion matrix like this one:

The confusion matrix indicates that, out of 100 total transactions, our model correctly predicted fraud four times and correctly predicted not fraud 91 times, yielding an overall accuracy of 95%. The confusion matrix, however, also enables us to see the number of times the model incorrectly predicted that a transaction was fraud—a false positive which occurred on two out of the 100 transactions. We can also see the number of times the model predicted a transaction was not fraud when it was—a false negative which occurred on three out of the 100 transactions.

While the model appears to boast a fairly strong “true negative” rate—the percentage of non-fraud messages correctly classified as such (91/(91+2)=97.8%), the model’s “true positive” rate—the percentage of fraud messages correctly flagged as such (4/(4+3)=57.1%) is far less attractive. Breaking down the model’s performance in this way paints a different and more complete picture than the 95% accuracy rate alone.

Evaluation methods apply to regression models, as well. Let’s assume we have a regression model that’s been trained to predict housing prices. The model’s predicted prices can be compared with actual prices using the mean squared-error, which measures the average of the squares of the errors, which are the differences between the actual and predicted price. The lower the mean squared-error, the better the model.

All models need to be subjected to evaluation—when they are built and throughout their lives. Supervised and unsupervised learning models pose different sorts of evaluation challenges, and selecting the right type of metrics is key.

[1] Many fraud detection models are also built using neural networks and other unsupervised learning techniques.

[2] Overfitting occurs when a model makes generalizations about coincidental data elements that in reality are not germane to the analysis. Continuing the example of fraud detection, overfitting may occur if model training detects a correlation between the length of a customer’s name (or whether the customer’s name begins with a vowel) and the likelihood that a transaction is fraudulent. Testing is likely to expose random, spurious correlations of this type for what they are, as they are not likely to be replicated in the test data set that has been held out from the training data. A model that has been “overfit” to its training data is likely to return a considerably lower accuracy ratio on the test data.

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

http://www.oreilly.com/data/free/files/evaluating-machine-learning-models.pdf

https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment

http://www.mit.edu/~9.54/fall14/slides/Class13.pdf

https://stats.stackexchange.com/questions/79028/performance-metrics-to-evaluate-unsupervised-learning

Feature Selection – Machine Learning Methods

Feature selection in machine learning refers to the process of isolating only those variables (or “features”) in a dataset that are pertinent to the analysis. Failure to do this effectively has many drawbacks, including: 1) unnecessarily complex models with difficult-to-interpret outcomes, 2) longer computing time, and 3) collinearity and overfitting. Effective feature selection eliminates redundant variables and keeps only the best subset of predictors in the model, thus making it possible to represent the data in the simplest way.This post begins by identifying steps that must be taken to prepare datasets for meaningful analysis—and how machine learning can help. We then introduce and discuss some commonly used machine learning techniques for variable selection.

Data Cleansing

Real world data contains a wide range of holes, noise, and inconsistencies. Before doing any statistical analysis, it is crucial to ensure that the data can be meaningfully analyzed. In practice, data cleansing is often the most time-consuming part of data analysis. This upfront investment is necessary, however, because the quality of data has a direct bearing on the reliability of model outputs.

Various machine learning projects require different sorts of data cleansing steps, but in general, when people speak of data cleansing, they are referring to the following specific tasks.

Cleaning Missing Values

Many machine learning techniques do not support data with missing values. To address this, we first need to understand why data are missing. Missing values usually occur simply because no information is provided, but other circumstances can lead to data holes as well. For instance, setting incorrect data types for attributes when data is extracted and integrated from multiple sources can cause data loss.

One way to investigate missing values is to identify patterns for missing data. For example, missing answers for certain questions from female respondents in a survey may indicate that those questions are only asked of male respondents. Another example might involve two loan records that share the same ID. If the second record contains blank values for every attribute except ‘Market Price,’ then the second record is likely simply updating the market price of the first record.

Once the early-stage evaluation of missing data is complete, we can set about determining how to address the problem. The easiest way to handle missing values is simply to ignore the records that contain them. However, this solution is not always practical. If a relatively large portion of the dataset contains missing values, then removing all of them could result in remaining data that may not be a good representation of the initial population. In that case, rather than filtering out relevant rows or attributes, a more proper approach is to impute missing values with sensible values.

A typical imputing method for categorical variables involves replacing the missing values with the most frequent value or with a newly created “unknown” category. For numeric variables, missing values might be replaced with mean or median values. Other, more advanced methods for dealing with missing values, e.g., listwise deletion for deleting rows with missing data and multiple imputation for substituting missing values, exist as well.

Reducing Noise in Data

“Noise” in data refers to erroneous values and outliers. Noise is an unavoidable problem which can be caused by human mistakes in data entry, technical problems, and many other factors. Noisy data adversely influences model performance, so its detection and removal has a key role to play in the data cleaning process.

There are two major noise types in data: class noise and attribute noise. Class noise often occurs in categorical variables and can include: 1) non-standardized class labels, 2) duplicate records mapping to different class labels, and 3) mislabeled records. Attribute noise refers to corruptive values and outliers, such as percentages inappropriately greater than 100% and placeholders (e.g., 999,000).¹

There are many ways to deal with noisy data. Certain type of noise can be easily identified by sorting the data—thus isolating text input where numeric input is expected and other placeholders. Other noise can be addressed only using statistical methods. Clustering analysis groups the data by similarity and can help with detecting irrelevant objects and outliers. Data binning is used to reduce the impact of observation errors by combining ‘neighborhood’ data into a small number of bins. Advanced smoothing algorithms, including moving average and loess, fit the data into regression functions to eliminate the effect due to random variation and allow important patterns to stand out.

Data Normalization

Data normalization converts numerical values into specific ranges to meet the needs of a model. Performing data normalization makes it possible to aggregate data with different scales. Several algorithms require normalized data. For example, it is necessary to normalize data before feeding into principal component analysis (PCA) so that all variables have zero mean and unit variance and therefore the same weight. This also applies when performing support vector machines (SVM), which assumes that the input data is in range [0,1] or [-1,1]. Unnormalized data slows down model convergence time and skews results.

The most common way of normalizing data involves Z-score. Also known as standard-score normalization, this approach normalizes the error by dividing the difference between the data and mean by standard deviation. Z-score normalization is often used when min and max are unknown. Another common method is feature scaling, which brings all values into range [0,1] by dividing the difference between the data and min by the difference between max and min. Other normalization methods include studentized residual, t-statistics, and coefficient of variation.

Feature Selection Methods²

Stepwise Procedures

A stepwise procedure adds or subtracts individual features from a model until the optimal mix is identified. Stepwise procedures take three forms: backward elimination, forward selection, and stepwise regression.

Backward elimination is the simplest method. It fits the model using all available features and then systematically removes features one at a time, beginning with the feature with the highest p-value (provided the p-value exceeds a given threshold, usually 5%). The model is refit after each elimination and process loops until a model is identified in which each feature’s p-value falls below the threshold.

Forward selection is the opposite of backward elimination. It includes no variables in the model at first and then systematically adds features one at a time, beginning with the lowest p-value (provided the p-value falls below a threshold). The model is refit after each addition and loops until additional features do not help model performance.

Stepwise regression combines backward elimination and forward selection by allowing a feature to be added or dropped at each iteration. Using this method, a newly added variable in an early stage may be removed later, and vice versa.

Criterion-Based Procedures

A variable’s p-value is not the only statistic that can be used for feature selection. Penalized-likelihood criteria, such as akaike information criterion (AIC) and bayesian information criterion (BIC), are also valuable. Lower AICs and BICs indicate that a model is more likely to be true. They are given as: nlog (RSS/n) + kp, where RSS is residual sum of square (which decreases as the model complexity increases), n is sample size, p is numbers of predictors, and k is two for AIC and log(n) for BIC. Both criteria penalize larger models as p goes up, and BIC penalizes model complexity more heavily, which explains why BIC tends to favor smaller models in comparison to AIC. Other criteria are 1) Adjusted R², which increases only if a new feature improves model performance more than expected, 2) PRESS, summing up squares of predicted residuals, and 3) Mallow’s C_p Statistic, estimating the average MSE of prediction.

Lasso and Ridge Regression

Lasso and ridge regressions are powerful techniques for dealing with large feature coefficients. Both approaches reduce overfitting by penalizing features with large coefficients and minimizing the difference between predicted value and observation, but they differ when adding penalized terms. Lasso adds a penalty term equivalent to the absolute value of the magnitude of coefficients, so that it zeros out target variables’ coefficients and eliminates them from the model. Ridge assigns a penalty equivalent to square of the magnitudes of the coefficients. Even though it does not shrink the coefficient to zero, it can regularize and constrain the coefficients to control variance.

Lasso and ridge regression models have been widely used in finance since their introduction. A recent example used both these methods in predicting corporate bankruptcy.³ In this study, the authors discovered that these regression methods are optimal as they handle multicollinearity and minimize the numerical instability that may occur due to overfitting.

Dimensionality Regression

“Dimensionality reduction” is a process of transforming an extraordinarily complex, “high-dimensional” dataset (i.e., one with thousands of variables or more) into a dataset that can tell the story using a significantly smaller number of variables.

The most popular linear technique for dimensionality reduction is principal component analysis (PCA). It converts complex dataset features into a new set of coordinates named principal components (PCs). PCs are created in such a way that each succeeding PC preserves the largest possible variance under the condition that it is uncorrelated with the preceding PCs. Keeping only the first several PCs in the model reduces data dimensionality and eliminates multi-collinearity among features.

PCA has a couple of potential pitfalls: 1) PCA is sensitive to the scale effects of the original variables (data normalization is required for performing PCA), and 2) Applying PCA to the data will hurt its ability to interpret the influence of individual features since the PCs are not real variables any more. For these reasons, PCA is not a good choice for feature selection if interpretation of results is important.

Dimensionality reduction and specifically PCA have practical applications to fixed income analysis, particularly in explaining term-structure variation in interest rates. Dimensionality reduction has also been applied to portfolio construction and analytics. It is well known that the first eigenvector identified by PCA maximally captures the systematic risk (variation of returns) of a portfolio.⁴ Quantifying and understanding this risk is essential when balancing a portfolio.

[1] http://sci2s.ugr.es/noisydata
[2] http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf
[3] Pereira, J. M., Basto, M., & da Silva, A. F. (2016). The Logistic Lasso and Ridge Regression in Predicting Corporate Failure. Procedia Economics and Finance, v.39, pp.634-641.
[4] Alexander, C. (2001). Market models: A guide to financial data analysis. John Wiley & Sons.

Fed MBS Runoff Portends More Negative Vega for the Broader Market

With much anticipation and fanfare, the Federal Reserve is finally on track to reduce its MBS holdings. Guidance from the September FOMC meeting reveals that the Fed will allow its MBS holdings to “run off,” reducing its position via prepayments as opposed to selling it off. What does this Fed MBS Runoff mean for the market? In the long-term, it means a large increase in net supply of Agency MBS and with it an increase in overall implied and realized volatility.

MBS: The Largest Net Source of Options in the Fixed-Income Market

We start this analysis with some basic background on the U.S. MBS market. U.S. homeowners, by in large, finance home purchases using fixed-rate 30-year mortgages. These fixed-rate mortgages amortize over time, allowing the homeowner to pay principal and interest in even, monthly payments. A homeowner has the option to pay off this mortgage early for any reason, which they tend to do when either the homeowner moves, often referred to as turnover, or when prevailing mortgage rates drop significantly below the homeowner’s current mortgage rate, referred to as refinancing or “refis.” As a rough rule-of-thumb, turnover has varied between 6% and 10% per annum as economic conditions vary, whereas refis can drive prepayments to 40% per annum under current lending conditions.[1] Rate refis account for most of a mortgage’s cash flow volatility. If the homeowner is long the option to refinance, the MBS holder is short that same option. Fixed-rate MBS shorten due to prepayments as rates drop, and extend as rates rise, putting the MBS holder into a short convexity (gamma) and short vega position. Some MBS holders hedge this risk explicitly, buying short- and longer-dated options to cover their short gamma/short vega risk. Others hedge dynamically, including money managers and long-only funds that tend to target a duration bogey. One way or another, the short-volatility risk from MBS is transmitted into the larger fixed-income market. Hence, the rates market is net short vol risk. While not all investors hedge their short-volatility position, the aggregate market tends to hedge a similar amount of the short-options position over time. Until, of course, the Fed, the largest buyer of MBS, entered the market. From the start of Quantitative Easing, the Fed purchased progressively more of the MBS market, until by the end of 2014 the Fed just under 30% of the agency MBS market. Over the course of five years, the effective size of the MBS market ex-Fed shrunk by more than a quarter. Since the Fed doesn’t hedge its position, either explicitly through options or implicitly through delta-hedging, the size of the market’s net-short volatility position dropped by a similar fraction.[2]

The Fed’s Balance Sheet

As of early October 2017, the Federal Reserve owned $1.77 trillion agency MBS, or just under 30% of the outstanding agency MBS market. The Fed publishes its holdings weekly which can be found on the New York Fed’s web site here. In the chart below, we summarize the Fed’s 30yr MBS holdings, which make up roughly 90% of the Fed’s MBS holdings. [3]

Runoff from the Fed

Following its September meeting, the Fed announced they will reduce their balance sheet by not reinvesting run-off from their treasury and MBS portfolio. If the Fed sticks to its plan, MBS monthly runoff from MBS will reach $20B by 2018 Q1. Assuming no growth in the aggregate mortgage market, runoff from these MBS will be replaced with the same size of new, at-the-money MBS passthroughs. Since the Fed is not reinvesting paydowns, these new passthroughs will re-enter the non-Fed-held MBS market, which does hedge volatility by either buying options or delta-hedging. Given the expected runoff rate of the Fed’s portfolio, we can now estimate the vega exposure of new mortgages entering the wider (non-Fed-held) market. When fully implemented, we estimate that $20B in new MBS represents roughly $34 million in vega hitting the market each month. To put that in perspective, that is roughly equivalent to $23 billion notional 3yr->5yr ATM swaption straddles hitting the market each and every month.

Conclusion

While the Fed isn’t selling its MBS holdings, portfolio runoff will have a significant impact on rate volatility. Runoff implies significant net issuance ex-Fed. It’s reasonable to expect increased demand for options hedging, as well as increased delta-hedging, which should drive both implied and realized vol higher over time. This change will manifest itself slowly as monthly prepayments shrinks the Fed’s position. But the reintroduction of negative vega into the wider market represents a change in paradigm which may lead to a more volatile rates market over time.

[1] In the early 2000s, prepayments hit their all-time highs with the aggregate market prepaying in excess of 60% per annum. [2] This is not entirely accurate. The short-vol position in a mortgage passthrough is also a function of its note rate (GWAC) with respect to the prevailing market rate, and the mortgage market has a distribution of note rates. But the statement is broadly true. [3] The remaining Fed holdings are primarily 15yr MBS pass-throughs.

« First ‹ Prev 23 24 252627 28 29 Next ›Last »

Traditional Data Visualization

Enhanced Data Visualization with Machine Learning

Prepayments

Aging Effects

Dissecting the Problem

What is a Non-Qualified Mortgage Loan?

Overview of Non-Prime Issuers

Next Market Phase

Grid Search

Random Search

Table 1 – Grid of Hyperparameter Values and Scores

Bayesian Optimization

Probability of Improvement

Expected Improvement

Gaussian Process Upper Confidence Bound

Parametric vs. Non-Parametric Tradeoffs

Machine Learning Models: Shrinkage Methods, Splines, and Decision Trees

Application to Finance and Mortgage Data

Top-Down EUC Inventory Building

Bottom-Up EUC Inventory Building

A Hybrid Approach to End-User Computing Controls

Unsupervised Learning

Evaluating Unsupervised Learning Models

Supervised Learning

Evaluating Supervised Models

Data Cleansing

Feature Selection Methods2

MBS: The Largest Net Source of Options in the Fixed-Income Market

The Fed’s Balance Sheet

Runoff from the Fed

Conclusion

Company

Products

Security & Compliance

Feature Selection Methods²