Linkedin    Twitter   Facebook

Get Started
Log In

Linkedin

Category: Article

A Brief Introduction to Agile Philosophy

Reducing time to delivery by developing in smaller incremental chunks and incorporating an ability to pivot is the cornerstone of Agile software development methodology.

“Agile” software development is a rarity among business buzz words in that it is actually a fitting description of what it seeks to accomplish. Optimally implemented, it is capable of delivering value and efficiency to business-IT partnerships by incorporating flexibility and an ability to pivot rapidly when necessary.

As a technology company with a longstanding management consulting pedigree, RiskSpan values the combination of discipline and flexibility inherent to Agile development and regularly makes use of the philosophy in executing client engagements. Dynamic economic environments contribute to business priorities that are seemingly in a near-constant state of flux. In response to these ever-evolving needs, clients seek to implement applications and application feature changes quickly and efficiently to realize business benefits early.

This growing need for speed and “agility” makes Agile software development methods an increasingly appealing alternative to traditional “waterfall” methodologies. Waterfall approaches move in discrete phases—treating analysis, design, coding, and testing as individual, stand-alone components of a software project. Historically, when the cost of changing plans was high, such a discrete approach worked best. Nowadays, however, technological advances have made changing the plan more cost-feasible. In an environment where changes can be made inexpensively, rigid waterfall methodologies become unnecessarily counterproductive for at least four reasons:

  1. When a project runs out of time (or money), individual critical phases—often testing—must be compressed, and overall project quality suffers.
  2. Because working software isn’t produced until the very end of the project, it is difficult to know whether the project is really on track prior to project completion.
  3. Not knowing whether established deadlines will be met until relatively late in the game can lead to schedule risks.
  4. Most important, discrete phase waterfalls simply do not respond well to the various ripple effects created by change.

 

Continuous Activities vs. Discrete Project Phases

Agile software development methodologies resolve these traditional shortcomings by applying techniques that focus on reducing overhead and time to delivery. Instead of treating fixed development stages as discrete phases, Agile treats them as continuous activities. Doing things simultaneously and continuously—for example, incorporating testing into the development process from day one—improves quality and visibility, while reducing risk. Visibility improves because being halfway through a project means that half of a project’s features have been built and tested, rather than having many partially built features with no way of knowing how they will perform in testing. Risk is reduced because feedback comes in from earliest stages of development and changes without paying exorbitant costs. This makes everybody happy.

 

Flexible but Controlled

Firms sometimes balk at Agile methods because of a tendency to equate “flexibility” and “agility” with a lack of organization and planning, weak governance and controls, and an abandonment of formal documentation. This, however, is a misconception. “Agile” does not mean uncontrolled—on the contrary, it is no more or less controlled than the existing organizational boundaries of standardized processes into which it is integrated. Most Agile methods do not advocate any particular methodology for project management or quality control. Rather, their intent is on simplifying the software development approach, embracing changing business needs, and producing working software as quickly as possible. Thus, Agile frameworks are more like a shell which users of the framework have full flexibility to customize as necessary.

 

Frameworks and Integrated Teams

Agile methodologies can be implemented using a variety of frameworks, including Scrum, Kanban, and XP. Scrum is the most popular of these and is characterized by producing a potentially shippable set of functionalities at the end of every iteration in two-week time boxes called sprints. Delivering high-quality software at the conclusion of such short sprints requires supplementing team activities with additional best practices, such as automated testing, code cleanup and other refactoring, continuous integration, and test-driven or behavior-driven development.

Agile teams are built around motivated individuals subscribing what is commonly referred to as a “lean Agile mindset.” Team members who embrace this mindset share a common vision and are motivated to contribute in ways beyond their defined roles to attain success. In this way, innovation and creativity is supported and encouraged. Perhaps most important, Agile promotes building relationships based on trust among team members and with the end-user customer in providing fast and high-quality delivery of software. When all is said and done, this is the aim of any worthwhile endeavor. When it comes to software development, Agile is showing itself to be an impressive means to this end.


CECL–Why Implement Now?

FASB permits early adoption of CECL for fiscal years beginning after December 15, 2018, including interim periods within the fiscal year. The decision of whether to pursue formal early adoption is a complex one hinging on specific factors that vary among institutions. We are finding, however, that early implementation of a CECL solution offers many potential benefits regardless of whether an institution opts for early adoption.

The benefits of implementing a CECL solution comfortably in advance of adoption are analogous to those of a flight simulator. Both enable professionals to gain insights into how a new system functions and reacts to various influences in a riskless environment. Like a flight simulator, early CECL system implementation enables institutions to move forward with this important standard in a controlled and well-considered manner. It benefits institutions by allowing time for analysis and supporting optimal decision-making.

It accomplishes this in the following ways.

 

Enhanced Data Readiness

Data readiness is a crucial factor in CECL implementation. Analysts and risk managers need to know whether their data will be sufficient to support effective functioning of a credit loss forecasting model specifically designed for their portfolio. Early implementation facilitates answering this question in advance by enabling institutions to assess the following:

  1. Internal data completeness
  2. Internal data reliability
  3. Internal data accessibility
  4. Externally sourced forecasting elements

Data preparation often requires a considerable investment of time and effort that can span months.  Obtaining data and developing interfaces to legacy systems and other sources requires planning and sufficient time to develop and test queries. Early implementation provides a safe environment for exploring data nuances and figuring out what works best.

 

Appropriate Cohort Segmentation

Loans and revolving credits are segmented into risk-related cohorts for credit loss forecasting purposes.  Clients frequently ask whether existing segments should be revisited or revised in a CECL implementation. The answer, like most, depends on many factors. However, when new segmentation is called for, it takes time to analyze the effect of risk factors on losses and to devise the new segmentation strategy. That work would ideally begin well in advance of formal CECL adoption, along with testing methodologies for each segment.

 

Early Operational and Internal Control Insights

Capturing the real-time data necessary to book lifetime losses at origination will pose new and daunting challenges. Traditional ALLL processes historically enabled banks to largely ignore originations occurring near the end of an accounting period. CECL changes all this by requiring day-one losses to be booked upon origination, thus potentially taxing data collection processes that likely will need to be streamlined in order to gather the data necessary to produce loss forecasts in time for the accounting close.

In addition to overcoming the obvious operational obstacles associated with this degree of data processing, early implementation gives banks an advance opportunity to assess the internal controls issues that will inevitably arise. Datasets and models that previously were not subject to SOX and financial reporting control testing may need to be reviewed in a new light. Controls around the databases and models supporting the life-of-loan calculation will almost certainly need to be enhanced to meet financial reporting expectations. Figuring out how to address all this in the relative safety of an early-implementation simulator is preferable to learning it on the fly.

Capital and Strategic Planning Benefits

Because CECL is generally expected to increase loan loss reserves, the effect on bank capital should be understood as soon as possible. Banks will want to understand the potential impact of CECL on capital to inform discussions with the board, auditors, regulators, and shareholders and to factor into capital planning in advance of formal adoption.

It is particularly important for banks that are active in mergers and acquisitions to understand the implications of CECL on earnings. This will inform efforts currently underway or planned for the next couple of years. Early CECL implementation also provides the ability to begin making lending and origination decisions informed by CECL.


Compared with the relatively complex decisions surrounding whether to pursue early CECL adoption, deciding whether to implement a CECL solution early is straightforward. Regardless of whether a bank opts for early adoption, early implementation promises a range of benefits that will support optimal risk management decisions going forward, regardless of when CECL is ultimately adopted.


Hands-On Machine Learning–Predicting Loan Delinquency

The ability of machine learning models to predict loan performance makes them particularly interesting to lenders and fixed-income investors. This expanded post provides an example of applying the machine learning process to a loan-level dataset in order to predict delinquency. The process includes variable selection, model selection, model evaluation, and model tuning.

The data used in this example are from the first quarter of 2005 and come from the publicly available Fannie Mae performance dataset. The data are segmented into two different sets: acquisition and performance. The acquisition dataset contains 217,000 loans (rows) and 25 variables (columns) collected at origination (Q1 2005). The performance dataset contains the same set of 217,000 loans coupled with 31 variables that are updated each month over the life of the loan. Because there are multiple records for each loan, the performance dataset contains approximately 16 million rows.

For this exercise, the problem is to build a model capable of predicting which loans will become severely delinquent, defined as falling behind six or more months on payments. This delinquency variable was calculated from the performance dataset for all loans and merged with the acquisition data based on the loan’s unique identifier. This brings the total number of variables to 26. Plenty of other hypotheses can be tested, but this analysis focuses on just this one.

1          Variable Selection

An overview of the dataset can be found below, showing the name of each variable as well as the number of observations available

                                            Count
LOAN_IDENTIFIER                             217088
CHANNEL                                     217088
SELLER_NAME                                 217088
ORIGINAL_INTEREST_RATE                      217088
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB)     217088
ORIGINAL_LOAN_TERM                          217088
ORIGINATION_DATE                            217088
FIRST_PAYMENT_DATE                          217088
ORIGINAL_LOAN-TO-VALUE_(LTV)                217088
ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV)      217074
NUMBER_OF_BORROWERS                         217082
DEBT-TO-INCOME_RATIO_(DTI)                  201580
BORROWER_CREDIT_SCORE                       215114
FIRST-TIME_HOME_BUYER_INDICATOR             217088
LOAN_PURPOSE                                217088
PROPERTY_TYPE                               217088
NUMBER_OF_UNITS                             217088
OCCUPANCY_STATUS                            217088
PROPERTY_STATE                              217088
ZIP_(3-DIGIT)                               217088
MORTGAGE_INSURANCE_PERCENTAGE                34432
PRODUCT_TYPE                                217088
CO-BORROWER_CREDIT_SCORE                    100734
MORTGAGE_INSURANCE_TYPE                      34432
RELOCATION_MORTGAGE_INDICATOR               217088

Most of the variables in the dataset are fully populated, with the exception of DTI, MI Percentage, MI Type, and Co-Borrower Credit Score. Many options exist for dealing with missing variables, including dropping the rows that are missing, eliminating the variable, substituting with a value such as 0 or the mean, or using a model to fill the most likely value.

The following chart plots the frequency of the 34,000 MI Percentage values.

The distribution suggests a decent amount of variability. Most loans that have mortgage insurance are covered at 25%, but there are sizeable populations both above and below. Mortgage insurance is not required for the majority of borrowers, so it makes sense that this value would be missing for most loans.  In this context, it makes the most sense to substitute the missing values with 0, since 0% mortgage insurance is an accurate representation of the state of the loan. An alternative that could be considered is to turn the variable into a binary yes/no variable indicating if the loan has mortgage insurance, though this would result in a loss of information.

The next variable with a large number of missing values is Mortgage Insurance Type. Querying the dataset reveals that that of the 34,400 loans that have mortgage insurance, 33,000 have type 1 borrower paid insurance and the remaining 1,400 have type 2 lender paid insurance. Like the mortgage insurance variable, the blank values can be filled. This will change the variable to indicate if the loan has no insurance, type 1, or type 2.

The remaining variable with a significant number of missing values is Co-Borrower Credit Score, with approximately half of its values missing. Unlike MI Percentage, the context does not allow us to substitute missing values with zeroes. The distribution of both borrower and co-borrower credit score as well as their relationship can be found below.

As the plot demonstrates, borrower and co-borrower credit scores are correlated. Because of this, the removal of co-borrower credit score would only result in a minimal loss of information (especially within the context of this example). Most of the variance captured by co-borrower credit score is also captured in borrower credit score. Turning the co-borrower credit score into a binary yes/no ‘has co-borrower’ variable would not be of much use in this scenario as it would not differ significantly from the Number of Borrowers variable. Alternate strategies such as averaging borrower/co-borrower credit score might work, but for this example we will simply drop the variable.

In summary, the dataset is now smaller—Co-Borrower Credit Score has been dropped. Additionally, missing values for MI Percentage and MI Type have been filled in. Now that the data have been cleaned up, the values and distributions of the remaining variables can be examined to determine what additional preprocessing steps are required before model building. Scatter matrices of pairs of variables and distribution plots of individual variables along the diagonal can be found below. The scatter plots are helpful for identifying multicollinearity between pairs of variables, and the distributions can show if a variable lacks enough variance that it won’t contribute to model performance.[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1089″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]The third row of scatterplots, above, reflects a lack of variability in the distribution of Original Loan Term. The variance of 3.01 (calculated separately) is very small, and as a result the variable can be removed—it will not contribute to any model as there is very little information to learn from. This process of inspecting scatterplots and distributions is repeated for the remaining pairs of variables. The Number of Units variable suffers from the same issue and can also be removed.

2          Heatmaps and Pairwise Grids

Matrices of scatterplots are useful for looking at the relationships between variables. Another useful plot is a heatmap and pairwise grid of correlation coefficients. In the plot below a very strong correlation between Original LTV and Original CLTV is identified.

This multicollinearity can be problematic for both the interpretation of the relationship between the variables and delinquency as well as the actual performance of some models.  To combat this problem, we remove Original CLTV because Original LTV is a more accurate representation of the loan at origination. Loans in this population that were not refinanced kept their original LTV value as CLTV. If CLTV were included in the model it would introduce information not available at origination to the model. The problem of allowing unexpected additional information in a dataset introduces an issue known as leakage, which will bias the model.

Now that the numeric variables have been inspected, the remaining categorical variables must be analyzed to ensure that the classes are not significantly unbalanced. Count plots and simple descriptive statistics can be used to identify categorical variables are problematic. Two examples below show the count of loans by state and by seller.

Inspecting the remaining variables uncovers that Relocation Indicator (indicating a mortgage issued when an employer moves an employee) and Product Type (fixed vs. adjustable rate) must be removed as they are extremely unbalanced and do not contain any information that will help the models learn. We also removed first payment date and origination date, which were largely redundant. The final cleanup results in a dataset that contains the following columns:

LOAN_IDENTIFIER 
CHANNEL 
SELLER_NAME
ORIGINAL_INTEREST_RATE
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB) 
ORIGINAL_LOAN-TO-VALUE_(LTV) 
NUMBER_OF_BORROWERS
DEBT-TO-INCOME_RATIO_(DTI) 
BORROWER_CREDIT_SCORE
FIRST-TIME_HOME_BUYER_INDICATOR 
LOAN_PURPOSE
PROPERTY_TYPE 
OCCUPANCY_STATUS 
PROPERTY_STATE
MORTGAGE_INSURANCE_PERCENTAGE 
MORTGAGE_INSURANCE_TYPE 
ZIP_(3-DIGIT)

The final two steps before model building are to standardize each of the numeric variables and turn each categorical variable into a series of dummy or indicator variables. Numeric variables are scaled with mean 0 and standard deviation 1 so that it is easier to compare variables that have a different scale (e.g. interest rate vs. LTV). Additionally, standardizing is also a requirement for many algorithms (e.g. principal component analysis).

Categorical variables are transformed by turning each value of the variable into its own yes/no feature. For example, Property State originally has 50 possible values, so it will be turned into 50 variables (e.g. Alabama yes/no, Alaska yes/no).  For categorical variables with many values this transformation will significantly increase the number of variables in the model.

After scaling and transforming the dataset, the final shape is 199,716 rows and 106 columns. The target variable—loan delinquency—has 186,094 ‘no’ values and 13,622 ‘yes’ values. The data are now ready to be used to build, evaluate, and tune machine learning models.

3          Model Selection

Because the target variable loan delinquency is binary (yes/no) the methods available will be classification machine learning models. There are many classification models, including but not limited to: neural networks, logistic regression, support vector machines, decision trees and nearest neighbors. It is always beneficial to seek out domain expertise when tackling a problem to learn best practices and reduce the number of model builds. For this example, two approaches will be tried—nearest neighbors and decision tree.

The first step is to split the dataset into two segments: training and testing. For this example, 40% of the data will be partitioned into the test set, and 60% will remain as the training set. The resulting segmentations are as follows:

1.       60% of the observations (as training set)- X_train

2.       The associated target (loan delinquency) for each observation in X_train- y_train

3.       40% of the observations (as test set)- X_test

4.        The targets associated with the test set- y_test

Data should be randomly shuffled before they are split, as datasets are often in some type of meaningful order. Once the data are segmented the model will first be exposed to the training data to begin learning.

4          K-Nearest Neighbors Classifier

Training a K-neighbors model requires the fitting of the model on X_train (variables) and y_train (target) training observations. Once the model is fit, a summary of the model hyperparameters is returned. Hyperparameters are model parameters not learned automatically but rather are selected by the model creator.

 

The K-neighbors algorithm searches for the closest (i.e., most similar) training examples for each test observation using a metric that calculates the distance between observations in high-dimensional space.  Once the nearest neighbors are identified, a predicted class label is generated as the class that is most prevalent in the neighbors. The biggest challenge with a K-neighbors classifier is choosing the number of neighbors to use. Another significant consideration is the type of distance metric to use.

To see more clearly how this method works, the 6 nearest neighbors of two random observations from the training set were selected, one that is a non-default (0 label) observation and one that is not.

Random delinquent observation: 28919 
Random non delinquent observation: 59504

The indices and minkowski distances to the 6 nearest neighbors of the two random observations are found below. Unsurprisingly, the first nearest neighbor is always itself and the first distance is 0.

Indices of closest neighbors of obs. 28919 [28919 112677 88645 103919 27218 15512]
Distance of 5 closest neighbor for obs. 28919 [0 0.703 0.842 0.883 0.973 1.011]

Indices of 5 closest neighbors for obs. 59504 [59504 87483 25903 22212 96220 118043]
Distance of 5 closest neighbor for obs. 59504 [0 0.873 1.185 1.186 1.464 1.488]

Recall that in order to make a classification prediction, the kneighbors algorithm finds the nearest neighbors of each observation. Each neighbor is given a ‘vote’ via their class label, and the majority vote wins. Below are the labels (or votes) of either 0 (non-delinquent) or 1 (delinquent) for the 6 nearest neighbors of the random observations. Based on the voting below, the delinquent observation would be classified correctly as 3 of the 5 nearest neighbors (excluding itself) are also delinquent. The non-delinquent observation would also be classified correctly, with 4 of 5 neighbors voting non-delinquent.

Delinquency label of nearest neighbors- non delinquent observation: [0 1 0 0 0 0]
Delinquency label of nearest neighbors- delinquent observation: [1 0 1 1 0 1]

 

5          Tree-Based Classifier

Tree based classifiers learn by segmenting the variable space into a number of distinct regions or nodes. This is accomplished via a process called recursive binary splitting. During this process observations are continuously split into two groups by selecting the variable and cutoff value that results in the highest node purity where purity is defined as the measure of variance across the two classes. The two most popular purity metrics are the gini index and cross entropy. A low value for these metrics indicates that the resulting node is pure and contains predominantly observations from the same class. Just like the nearest neighbor classifier, the decision tree classifier makes classification decisions by ‘votes’ from observations within each final node (known as the leaf node).

To illustrate how this works, a decision tree was created with the number of splitting rules (max depth) limited to 5. An excerpt of this tree can be found below. All 120,000 training examples start together in the top box. From top to bottom, each box shows the variable and splitting rule applied to the observations, the value of the gini metric, the number of observations the rule was applied to, and the current segmentation of the target variable. The first box indicates that the 6th variable (represented by the 5th index ‘X[5]’) Borrower Credit Score was  used to  split  the  training  examples.  Observations where the value of Borrower Credit Score was below or equal to -0.4413 follow the line to the box on the left. This box shows that 40,262 samples met the criteria. This box also holds the next splitting rule, also applied to the Borrower Credit Score variable. This process continues with X[2] (Original LTV) and so on until the tree is finished growing to its depth of 5. The final segments at the bottom of the tree are the aforementioned leaf nodes which are used to make classification decisions.  When making a prediction on new observations, the same splitting rules are applied and the observation receives the label of the most commonly occurring class in its leaf node.

[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1086″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]A more advanced tree based classifier is the Random Forest Classifier. The Random Forest works by generating many individual trees, often hundreds or thousands. However, for each tree, number of variables considered at each split is limited to a random subset. This helps reduce model variance and de-correlate the trees (since each tree will have a different set of available splitting choices). In our example, we fit a random forest classifier on the training data. The resulting hyperparameters and model documentation indicate that by default the model generates 10 trees, considers a random subset of variables the size of the square root of all variables (approximately 10 in this case), has no depth limitation, and only requires each leaf node to have 1 observation.

Since the random forest contains many trees and does not have a depth limitation, it is incredibly difficult to visualize. In order to better understand the model, a plot showing which variables were selected and resulted in the largest drop in the purity metric (gini index) can be useful. Below are the top 10 most important variables in the model, ranked by the total (normalized) reduction to the gini index.  Intuitively, this plot can be described as showing which variables can be used to best segment the observations into groups that are predominantly one class, either delinquent and non-delinquent.

 

6          Model Evaluation

Now that the models have been fitted, their performance must be evaluated. To do this, the fitted model will first be used to generate predictions on the test set (X_test). Next, the predicted class labels are compared to the actual observed class label (y_test). Three of the most popular classification metrics that can be used to compare the predicted and actual values are recall, precision, and the f1-score. These metrics are calculated for each class, delinquent and not-delinquent.

Recall is calculated for each class as the ratio of events that were correctly predicted. More precisely, it is defined as the number of true positive predictions divided by the number of true positive predictions plus false negative predictions. For example, if the data had 10 delinquent observations and 7 were correctly predicted, recall for delinquent observations would be 7/10 or 70%.

Precision is the number of true positives divided by the number of true positives plus false positives. Precision can be thought of as the ratio of events correctly predicted to the total number of events predicted. In the hypothetical example above, assume that the model made a total of 14 predictions for the label delinquent. If so, then the precision for delinquent predictions would be 7/14 or 50%.

The f1 score is calculated as the harmonic mean of recall and precision: (2(Precision*Recall/Precision+Recall)).

The classification reports for the K-neighbors and decision tree below show the precision, recall, and f1 scores for label 0 (non-delinquent) and 1 (delinquent).

 

There is no silver bullet for choosing a model—often it comes down to the goals of implementation. In this situation, the tradeoff between identifying more delinquent loans at the cost of misclassification can be analyzed with a specific tool called a roc curve.  When the model predicts a class label, a probability threshold is used to make the decision. This threshold is set by default at 50% so that observations with more than a 50% chance of membership belong to one class and vice-versa.

The majority vote (of the neighbor observations or the leaf node observations) determines the predicted label. Roc curves allow us to see the impact of varying this voting threshold by plotting the true positive prediction rate against the false positive prediction rate for each threshold value between 0% and 100%.

The area under the ROC curve (AUC) quantifies the model’s ability to distinguish between delinquent and non-delinquent observations.  A completely useless model will have an AUC of .5 as the probability for each event is equal. A perfect model will have an AUC of 1 as it is able to perfectly predict each class.

To better illustrate, the ROC curves plotting the true positive and false positive rate on the held-out test set as the threshold is changed are plotted below.

7          Model Tuning

Up to this point the models have been built and evaluated using a single train/test split of the data. In practice this is often insufficient because a single split does not always provide the most robust estimate of the error on the test set. Additionally, there are more steps required for model tuning. To solve both of these problems it is common to train multiple instances of a model using cross validation. In K-fold cross validation, the training data that was first created gets split into a third dataset called the validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated times, each time holding out a different portion of the training set to validate against. Once the model has been tuned using the train/validation splits, it is tested against the held out test set just as before. As a general rule, once data have been used to make a decision about the model they should never be used for evaluation.

8          K-Nearest Neighbors Tuning

Below a grid search approach is used to tune the K-nearest neighbors model. The first step is to define all of the possible hyperparameters to try in the model. For the KNN model, the list nk = [10, 50, 100, 150, 200, 250] specifies the number of nearest neighbors to try in each model. The list is used by the function GridSearchCV to build a series of models, each using the different value of nk. By default, GridSearchCV uses 3-fold cross validation. This means that the model will evaluate 3 train/validate splits of the data for each value of nk. Also specified in GridSearchCV is the scoring parameter used to evaluate each model. In this instance it is set to the metric discussed earlier, the area under the roc curve. GridSearchCV will return the best performing model by default, which can then be used to generate predictions on the test set as before. Many more values of could be specified to search through, and the default minkowski distance could be set to a series of metrics to try. However, this comes at a cost of computation time that increases significantly with each added hyperparameter.

 

In the plot below the mean training and validation scores of the 3 cross-validated splits is plotted for each value of K. The plot indicates that for the lower values of the model was overfitting the training data and causing lower validation scores. As increases, the training score lowers but the validation score increases because the model gets better at generalizing to unseen data.

9               Random Forest Tuning

There are many hyperparameters that can be adjusted to tune the random forest model. We use three in our example: n_estimatorsmax_features, and min_samples_leafN_estimators refers to the number of trees to be created. This value can be increased substantially, so the search space is set to list estimators. Random Forests are generally very robust to overfitting, and it is not uncommon to train a classifier with more than 1,000 trees. Second, the number of variables to be randomly considered at each split can be tuned via max_features. Having a smaller value for the number of random features is helpful for decorrelating the trees in the forest, which is especially useful when multicollinearity is present. We tried a number of different values for max_features, which can be found in the list features. Finally, the number of observations required in each leaf node is tuned via the min_samples_leaf parameter and list samples.

 

The resulting plot, below, shows a subset of the grid search results. Specifically, it shows the mean test score for each number of trees and leaf size when the number of random features considered at each split is limited to 5. The plot demonstrates that the best performance occurs with 500 trees and a requirement of at least 5 observations per leaf. To see the best performing model from the entire grid space the best estimator method can be used.

By default, parameters of the best estimator are assigned to the GridSearch object (cvknc and cvrfc). This object can now be used generate future predictions or predicted probabilities. In our example, the tuned models are used to generate predicted probabilities on the held out test set. The resulting

ROC curves show an improvement in the KNN model from an AUC of .62 to .75. Likewise, the tuned Random Forest AUC improves from .64 to .77.

Predicting loan delinquency using only origination data is not an easy task. Presumably, if significant signal existed in the data it would trigger a change in strategy by MBS investors and ultimately origination practices. Nevertheless, this exercise demonstrates the capability of a machine learning approach to deconstruct such an intricate problem and suggests the appropriateness of using machine learning model to tackle these and other risk management data challenges relating to mortgages and a potentially wide range of asset classes.

Talk Scope


Big Data in Small Dimensions: Machine Learning Methods for Data Visualization

Analysts and data scientists are constantly seeking new ways to parse increasingly intricate datasets, many of which are deemed “high dimensional”, i.e., contain many (sometimes hundreds or more) individual variables. Machine learning has recently emerged as one such technique due to its exceptional ability to process massive quantities of data. A particularly useful machine learning method is t-distributed stochastic neighbor embedding (t-SNE), used to summarize very high-dimensional data using comparatively few variables. T-SNE visualizations allow analysts to identify hidden structures that may have otherwise been missed.

Traditional Data Visualization

The first step in tackling any analytical problem is to develop a solid understanding of the dataset in question. This process often begins with calculating descriptive statistics that summarize useful characteristics of each variable, such as the mean and variance. Also critical to this pursuit is the use of data visualizations that can illustrate the relationships between observations and variables and can identify issues that must be corrected. For example, the chart below shows a series of pairwise plots between a set of variables taken from a loan-level dataset. Along the diagonal axis the distribution of each individual variable is plotted.

The plot above is useful for identifying pairs of variables that are highly correlated as well as variables that lack variance, such as original loan term. When dealing with a larger number of variables, heatmaps like the one below can summarize the relationships between the data in a compact way that is also visually intuitive.

The statistics and visualizations described so far are helpful for summarizing and identifying issues, but they often fall short in telling the entire narrative of the data. One issue that remains is a lack of understanding of the underlying structure of the data. Gaining this understanding is often key to selecting the best approach for problem solving.

Enhanced Data Visualization with Machine Learning

Humans can visualize observations plotted with up to three variables (dimensions), but with the exponential rise in data collection it is now abnormal to only be dealing with a handful of variables. Thankfully, there are new machine learning methods that can help overcome our limited capacity and deliver new insights never seen before.

T-SNE is a type of non-linear dimensionality reduction algorithm. While this is a mouthful, the idea behind it is straightforward: t-SNE takes data that exists in very high dimensions and produces a plot in two or three dimensions that can be observed. The plot in low dimensions is created in such a way that observations close to each other in high dimensions remain close together in low dimensions. Additionally, t-SNE has proven to be good at preserving both the global and local structures present within the data1, which is of critical importance.

The full technical details of t-SNE are beyond the scope of this blog, but a simplified version of the steps for t-SNE are as follows:

  1. Compute the Euclidean distance between each pair of observations in high-dimensional space.
  2. Using a Gaussian distribution, convert the distance between each pair of observations into a probability that represents similarity between the points.
  3. Randomly place the observations into low-dimensional space (usually 2 or 3).
  4. Compute the distance and similarity (as in steps 1 and 2) for each pair of observations in the low-dimensional space. Crucially, in this step a Student t-distribution is used instead of a normal Gaussian.
  5. Using gradient based optimization, iteratively nudge the observations in the low-dimensional space in such a way that the probabilities between pairs of observations are as close as possible to the probabilities in high dimensions.

Two key consideration are the use of the Student t-distribution in step four as opposed to the Gaussian in step two, and the random initialization of the data points in low dimensional space. The t-distribution is critical to the success of the algorithm for multiple reasons, but perhaps most importantly in that it allows clusters that initially start far apart to re-converge2. Given the random initialization of the points in low dimensional space, it is common practice to run the algorithm multiple times with the same parameters to observe the best mapping and ensure that the gradient descent optimization does not get stuck in a local minima.

We applied t-SNE to a loan-level dataset comprised of approximately 40 variables. The loans are a random sample of originations from every quarter dating back to 1999. T-SNE was used to map the data into just three dimensions and the resulting plot was color-coded based on the year of origination.

In the interactive visualization below many clusters emerge. Rotating the figure reveals that some clusters are comprised predominantly of loans within similar origination years (groups of same-colored data points). Other clusters are less well-defined or contain a mix of origination years. Using this same method, we could choose to color loans with other information that we may wish to explore. For example, a mapping showing clusters related to delinquencies, foreclosure, or other credit loss events could prove tremendously insightful. For a given problem, using information from a plot such as this can enhance the understanding of the problem separability and enhance the analytical approach.

Crucial to the t-SNE mapping is a parameter set by the analyst called perplexity, which should be roughly equal to the number of expected nearby neighbors for each data point. Therefore, as the value of perplexity increases, the number of resulting clusters should generally decrease and vice versa. When implementing t-SNE, various perplexity parameters should be tried as the appropriate value is generally not known beforehand. The plot below was produced using the same dataset as before but with a larger value of perplexity. In this plot four distinct clusters emerge, and within each cluster loans of similar origination years group closely together.


Private-Label Securities – Technological Solutions to Information Asymmetry and Mistrust

At its heart, the failure of the private-label residential mortgage-backed securities (PLS) market to return to its pre-crisis volume is a failure of trust. Virtually every proposed remedy, in one way or another, seeks to create an environment in which deal participants can gain reasonable assurance that their counterparts are disclosing information that is both accurate and comprehensive. For better or worse, nine-figure transactions whose ultimate performance will be determined by the manner in which hundreds or thousands of anonymous people repay their mortgages cannot be negotiated on the basis of a handshake and reputation alone. The scale of these transactions makes manual verification both impractical and prohibitively expensive. Fortunately, the convergence of a stalled market with new technologies presents an ideal time for change and renewed hope to restore confidence in the system.

 

Trust in Agency-Backed Securities vs Private-Label Securities

Ginnie Mae guaranteed the world’s first mortgage-backed security nearly 50 years ago. The bankers who packaged, issued, and invested in this MBS could scarcely have imagined the technology that is available today. Trust, however, has never been an issue with Ginnie Mae securities, which are collateralized entirely by mortgages backed by the federal government—mortgages whose underwriting requirements are transparent, well understood, and consistently applied.

Further, the security itself is backed by the full faith and credit of the U.S. Government. This degree of “belt-and-suspenders” protection afforded to investors makes trust an afterthought and, as a result, Ginnie Mae securities are among the most liquid instruments in the world.

Contrast this with the private-label market. Private-label securities, by their nature, will always carry a higher degree of uncertainty than Ginnie Mae, Fannie Mae, and Freddie Mac (i.e., “Agency”) products, but uncertainty is not the problem. All lending and investment involves uncertainty. The problem is information asymmetry—where not all parties have equal access to the data necessary to assess risk. This asymmetry makes it challenging to price deals fairly and is a principal driver of illiquidity.

 

Using Technology to Garner Trust in the PLS Market

In many transactions, ten or more parties contribute in some manner to verifying and validating data, documents, or cash flow models. In order to overcome asymmetry and restore liquidity, the market will need to refine (and in some cases identify) technological solutions to, among other challenges, share loan-level data with investors, re-envision the due diligence process, and modernize document custody.

 

Loan-Level Data

During SFIG’s Residential Mortgage Finance symposium last month, RiskSpan moderated a panel that featured significant discussion around loan-level disclosures. At issue was whether the data required by the SEC’s Regulation AB provided investors with all the information necessary to make an investment decision. Specifically debated was the mortgaged property’s zip code, which provides investors valuable information on historical valuation trends for properties in a given geographic area.

Privacy advocates question the wisdom of disclosing full, five-digit zip codes. Particularly in sparsely populated areas where zip codes contain a relatively small number of addresses, knowing the zip code along with the home’s sale price and date (which are both publicly available) can enable unscrupulous data analysts to “triangulate” in on an individual borrower’s identity and link the borrower to other, more sensitive personal information in the loan-level disclosure package.

The SEC’s “compromise” is to require disclosing only the first two digits of the zip code, which provide a sense of a property’s geography without the risk of violating privacy. Investors counter that two-digit zip codes do not provide nearly enough granularity to make an informed judgment about home-price stability (and with good reason—some entire states are covered entirely by a single two-digit zip code).

The competing demands of disclosure and privacy can be satisfied in large measure by technology. Rather than attempting to determine which individual data fields should be included in a loan-level disclosure (and then publishing it on the SEC’s EDGAR site for all the world to see) the market ought to be able to develop a technology where a secure, encrypted, password-protected copy of the loan documents (including the loan application, tax documents, pay-stubs, bank statements, and other relevant income, employment, and asset verifications) is made available on a need-to-know basis to qualified PLS investors who share in the responsibility for safeguarding the information.

 

Due Diligence Review

Technologically improving the transparency of the due diligence process to investors may also increase investor trust, particularly in the representation and warranty review process. Providing investors with a secure view of the loan-level documentation used to underwrite and close the underlying mortgage loan, as described above, may reduce the scope of due diligence review as it exists in today’s market. Technology companies, which today support initiatives such as Fannie Mae’s “Day 1 Certainty” program, promise to further disrupt the due diligence process in the future. Through automation, the due diligence process becomes less burdensome and fosters confidence in the underwriting process while also reducing costs and bringing representation and warranty relief.

Today’s insistence on 100% file reviews in many cases is perhaps the most obvious evidence of the lack of trust across transactions. Investors will likely always require some degree of assurance that they are getting what they pay for in terms of collateral. However, an automated verification process for income, assets, and employment will launch the industry forward with investor confidence. Should any reconciliation of individual loan file documentation with data files be necessary, results of these reconciliations could be automated and added to a secure blockchain accessible only via private permissions. Over time, investors will become more comfortable with the reliability of the electronic data files describing the mortgage loans submitted to them.

The same technology could be implemented to allow investors to view supporting documents when reps and warrants are triggered and a review of the underlying loan documents needs to be completed.

 

Document Custody

Smart document technologies also have the potential to improve the transparency of the document custody process. At some point the industry is going to have to move beyond today’s humidity-controlled file cabinets and vaults, where documents are obtained and viewed only on an exception basis or when loans are paid off. Adding loan documents that have been reviewed and accepted by the securitization’s document custodian to a secure, permissioned blockchain will allow investors in the securities to view and verify collateral documents whenever questions arise without going to the time and expense of retrieving paper from the custodian’s vault.

——————————-

Securitization makes mortgages and other types of borrowing affordable for a greater population by leveraging the power of global capital markets. Few market participants view mortgage loan securitization dominated by government corporations and government-sponsored enterprises as a desirable permanent solution. Private markets, however, are going to continue to lag markets that benefit from implicit and explicit government guarantees until improved processes, supported by enhanced technologies, are successful in bridging gaps in trust and information asymmetry.

With trust restored, verified by technology, the PLS market will be better positioned to support housing financing needs not supported by the Agencies.

Get a Demo


Back-Testing: Using RS Edge to Validate a Prepayment Model

Most asset-liability management (ALM) models contain an embedded prepayment model for residential mortgage loans. To gauge their accuracy, prepayment modelers typically run a back-test comparing model projections to the actual prepayment rates observed. A standard test is to run a portfolio of loans as of a year ago using the actual interest rates experienced during this time as well as any additional economic factors used by the model such as home price appreciation or the unemployment rate. This methodology isolates the model’s ability to estimate voluntary payoffs from its ability to forecast the economic variables.

The graph below was produced from such a back-test. The residential mortgage loans in the bank’s portfolio as of 10/31/2016 were run through the ALM model (projections) and compared with the observed speeds (actuals). It is apparent that the model did not do a particularly good job forecasting the actual CPRs, as the mean absolute error is 5.0%. Prepayment model validators typically prefer to see mean absolute error rates no higher than 1 to 2%.

Does this mean there is something unique with the bank’s loan portfolio or servicing practices that would cause prepays to deviate from expectations, or does the prepayment model require calibration?

Dissecting the Problem

One strategy is to compare the bank’s prepayment experience to that of the market (see below). The “market” is the universe of comparable loans, in this case residential, conventional loans. This assessment should indicate whether the bank’s portfolio is unique or if it behaves similar to the market. Although this comparison looks better, there are still some material differences, especially at the beginning and end of the time series. 

Examining the portfolio composition reveals a number of differences which could be the source of the discrepancy. For example:

  • Larger-balance loans have a greater refinance incentive.
  • California loans historically prepay faster than the rest of the country, while New York loans are historically slower.
  • Broker and correspondent loans typically pay faster than retail originations.

To compensate, the next step is to adjust the market portfolio to more closely mirror the attributes of the bank’s portfolio. Fine-tuning the “market” so that it better aligns with the bank’s channel and geographic breakout, as well as its larger average loan size, results in the following adjusted prepayment speeds.

Conclusion

Prepayments for the bank’s mortgage portfolio track the market speeds reasonably well with no adjustments. Compensating for the differences in composition related to channel, geography, and loan size tracks even better and results in a mean absolute error of only 1.1%. This indicates that there is nothing unique or idiosyncratic with the bank’s portfolio that would cause projections from a market-based prepayment model to deviate significantly from the observed speeds. Consequently, the ALM prepayment model likely needs adjustments to its tuning parameters to better capture the current environment.


Why Model Validation Does Not Eliminate Spreadsheet Risk

Model risk managers invest considerable time in determining which spreadsheets qualify as models, which are end-user computing (EUC) applications, and which are neither. Seldom, however, do model risk managers consider the question of whether a spreadsheet is the appropriate tool for the task at hand.

Perhaps they should start.

Buried in the middle of page seven of the joint Federal Reserve/OCC supervisory guidance on model risk management is this frequently overlooked principle:

“Sound model risk management depends on substantial investment in supporting systems to ensure data and reporting integrity, together with controls and testing to ensure proper implementation of models, effective systems integration, and appropriate use.”

It brings to mind a fairly obvious question: What good is a “substantial investment” in data integrity surrounding the modeling process when the modeling itself is carried out in Excel? Spreadsheets are useful tools, to be sure, but they meet virtually none of the development standards to which traditional production systems are held. What percentage of “spreadsheet models” are subjected to the rigors of the software development life cycle (SDLC) before being put into use?

 

Model Validation vs. SDLC

More often than not, and usually without realizing it, banks use model validation as a substitute for SDLC when it comes to spreadsheet models. The main problem with this approach is that SDLC and model validation are complementary processes and are not designed to stand in for one another. SDLC is a primarily forward-looking process to ensure applications are implemented properly. Model validation is primarily backward looking and seeks to determine whether existing applications are working as they should.

SDLC includes robust planning, design, and implementation—developing business and technical requirements and then developing or selecting the right tool for the job. Model validation may perform a few cursory tests designed to determine whether some semblance of a selection process has taken place, but model validation is not designed to replicate (or actually perform) the selection process.

This presents a problem because spreadsheet models are seldom if ever built with SDLC principles in mind. Rather, they are more likely to evolve organically as analysts seek increasingly innovative ways of automating business tasks. A spreadsheet may begin as a simple calculator, but as analysts become more sophisticated, they gradually introduce increasingly complex functionality and coding into their spreadsheet. And then one day, the spreadsheet gets picked up by an operational risk discovery tool and the analyst suddenly becomes a model owner. Not every spreadsheet model evolves in such an unstructured way, of course, but more than a few do. And even spreadsheet-based applications that are designed to be models from the outset are seldom created according to a disciplined SDLC process.

I am confident that this is the primary reason spreadsheet models are often so poorly documented. They simply weren’t designed to be models. They weren’t really designed at all. A lot of intelligent, critical thought may have gone into their code and formulas, but little if any thought was likely given to the question of whether a spreadsheet is the best tool for what the spreadsheet has evolved to be able to do.
 

Challenging the Spreadsheets Themselves

Outside of banking, a growing number of firms are becoming wary of spreadsheets and attempting to move away from them. A Wall Street Journal article last week cited CFOs from companies as diverse as P.F. Chang’s China Bistro Inc., ABM Industries, and Wintrust Financial Corp. seeking to “reduce how much their finance teams use Excel for financial planning, analysis and reporting.”

Many of the reasons spreadsheets are falling out of favor have little to do with governance and risk management. But one core reason will resonate with anyone who has ever attempted to validate a spreadsheet model. Quoting from the article: “Errors can bloom because data in Excel is separated from other systems and isn’t automatically updated.”

It is precisely this “separation” of spreadsheet data from its sources that is so problematic for model validators. Even if a validator can determine that the input data in the spreadsheet is consistent with the source data at the time of validation, it is difficult to ascertain whether tomorrow’s input data will be. Even spreadsheets that pull input data in via dynamic linking or automated feeds can be problematic because the code governing the links and feeds can so easily become broken or corrupted.
 

An Expanded Way of Thinking About “Conceptual Soundness”

Typically, when model validators speak of evaluating conceptual soundness, they are referring to the model’s underlying theory, how its variables were selected, the reasonableness of its inputs and assumptions, and how well everything is documented. In diving into these details, it is easy to overlook the supervisory guidance’s opening sentence in the Evaluation of Conceptual Soundness section: “This element involves assessing the quality of the model design and construction.”

How often, in assessing a spreadsheet model’s design and construction, do validators ask, “Is Excel even the right application for this?” Not very often, I suspect. When an analyst is assigned to validate a model, the medium is simply a given. In a perfect world, model validators would be empowered to issue a finding along the lines of, “Excel is not an appropriate tool for a high-risk production model of this scope and importance.” Practically speaking, however, few departments will be willing to upend the way they work and analyze data in response to a model validation finding. (In the WSJ article, it took CFOs to affect that kind of change.)

Absent the ability to nudge model owners away from spreadsheets entirely, model validators would do well to incorporate certain additional “best practices” checks into their validation procedures when the model in question is a spreadsheet. These might include the following:

  • Incorporation of a cover sheet on the first tab of the workbook that includes the model’s name, the model’s version, a brief description of what the model does, and a table of contents defining and describing the purpose of each tab
  • Application of a consistent color key so that inputs, assumptions, macros, and formulas can be easily identified
  • Grouping of inputs by source, e.g., raw data versus transformed data versus calculations
  • Grouping of inputs, processing, and output tabs together by color
  • Separate instruction sheets for data import and transformation

Spreadsheets present unique challenges to model validators. By accounting for the additional risk posed by the nature of spreadsheets themselves, model risk managers can contribute value by identifying situations where the effectiveness of sound data, theory, and analysis is blunted by an inadequate tool.


Non-Qualified Mortgage Securitization Market

Since 2015, a new tier of the private-label residential mortgage-backed securities (PLS) market has emerged, with securities collateralized by non-qualified mortgage (non-QM) loans. These securities enable mortgage lenders to serve borrowers with non-traditional credit profiles.

The financial crisis ushered in a sharp reduction in mortgage credit available to certain groups of borrowers. Funding sources, such as the PLS market, which once provided access for borrowers with credit blemishes, non-traditional income sources, or the desire for expanded product features were virtually eliminated.

The limited issuance of private-label RMBS since the financial crisis has generally consisted of new origination jumbo “prime” mortgage loans. These securities have included loans that meet the “qualified mortgage” (QM) standard with strong credit scores, pristine payment history, and fully documented income and assets. The non-QM market addresses a previously underserved market and reflects the expanding credit policies of many institutions.

What is a Non-Qualified Mortgage Loan?

Since the crisis, standards governing the majority of mortgage loan production have generally followed the restrictive credit criteria implemented by the GSEs. This has prompted some consumers and lenders to seek alternative products that may not meet the “qualified mortgage” requirements or the high-credit-quality standards of the GSEs. These tightened credit standards have restricted home ownership opportunities for certain groups of consumers. These groups include self-employed individuals and borrowers with weaker credit or a recent credit event, such as a foreclosure, short sale, or deed in lieu of foreclosure. While many of these potential borrowers can meet the criteria of the ‘ability-to-repay’ rule and have taken steps to improve their credit standing, they nevertheless are not able to meet the very high credit standards that have emerged since the financial crisis.

To meet the demand of these underserved borrowers, a number of lenders have begun to expand their credit parameters. As lenders have sought funding sources for these non-QM originations, a new tier of the PLS market has emerged. While it is difficult to create generic categories that define the origination practices of the various lenders, some high-level similarities can be observed in the following non-QM products and programs established to meet borrower demand:

  • Alternative Documentation – the borrower’s income is assessed through sources other than available tax returns, business earnings, or Appendix Q requirements. Many non-QM lenders offer variations of bank statement programs (e.g., 24-month review and 12-month review) to determine a self-employed borrower’s ability to repay through analysis of their monthly cash flow.
  • Borrowers with Non-Standard Credit Profile
    • Expanded Credit – borrowers with weaker FICO scores, a recent delinquency on a mortgage, a debt-to-income ratio slightly above the qualified mortgage requirements, or higher loan-to-value ratios.
    • Prior Credit Event – borrowers with recent foreclosure, bankruptcy, or other loss mitigation disposition that have not met the seasoning requirements established by GSE guidelines.
  • Investor Program – financing for investors purchasing 1-4 family rental properties that may not meet GSE guidelines.
  • Foreign National Program – financing for borrowers that are not permanent residents or do not have credit history in the United States.
  • Non-QM Product Features – financing for products that do not meet qualified mortgage guidelines, such as loans with interest-only or balloon features.

Each of these programs evaluate many aspects of the loan during the underwriting process but primarily rely on an evaluation of the borrower’s ability to repay the loan to predict loan performance. These mortgage loan products and programs attempt to meet the housing finance needs of underserved borrowers while assessing the increased risk associated with the expanded lending standards.

Non-QM securities are likely to experience more performance volatility and higher realized losses than their jumbo prime counterparts in negative economic scenarios. This is due to weaker credit profiles among non-QM borrowers, product features that do not meet “qualified mortgage” requirements (e.g., interest-only, balloon payments, prepayment penalties), and alternative methods to assess the borrower’s ability-to-repay. Investors in these securities are challenged to assess the magnitude of the increased risk of loss (net of protection provided by credit enhancement levels) versus the incremental yield provided by the securities.

Overview of Non-Prime Issuers

The non-QM sector has been created and led by non-bank financial institutions that have filled the void left by regulated banking entities that have reduced their footprint in the mortgage market. Most financial institutions that have entered the non-QM mortgage space during the past five years have received financial backing from asset managers, hedge funds or private equity firms. Securitization activity for this sector of the PLS market began in 2015 and has increased slowly since. The table below reflects the strong growth in issuance activity for non-QM securitizations between January 2015 and September 2017:

Next Market Phase

The push by mortgage lenders to expand their credit criteria and provide consumers with “affordability” products combined with investor demand for higher yielding investments set the stage for the financial crisis of 2007-2008. Bolstered by strong demand from investors for mortgage-backed securities, mortgage lenders expanded underwriting guidelines to allow borrowers with weaker credit profiles, smaller down-payment amounts, and limited or no verification of income or assets to qualify for mortgages. Weakened underwriting standards were combined with product features that slowed repayment of principal through interest-only, negative amortization and loan term extension features.

History has shown that the combination of these credit guideline expansions with weaker PLS processes resulted in historic losses. As a reaction to the abysmal credit performance of mortgage loans originated between 2005 and 2007, credit availability in the mortgage market contracted dramatically. The swing of the credit pendulum resulted in significant improvement in the credit performance of loans originated since 2008. This improved performance, however, came at the cost of shutting a large segment of the population out of the mortgage market. Now almost a decade later, the pendulum appears to be swinging back in favor expanding credit criteria to accommodate more non-QM borrowers. Time will tell whether the market has learned and will remember the lessons of the financial crisis.


Tuning Machine Learning Models

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”

Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate hyperparameters. This post focuses on three:

  • Grid Search
  • Random Search
  • Bayesian Optimization

Grid Search

Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.

For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:

equations.PNG

Then the pairing of the selected hyperparametric values, H, can take on any of the following:

Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.

Random Search

Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).

equation 2.PNG

To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).

Table 1 – Grid of Hyperparameter Values and Scores

Highlighted in green are the 21 pairings with the highest scores out of the 450 total combinations. Let’s take these 21 pairings to be our desired target range. What if we were to sample points from this grid to see if any lands within the target? Each random draw has a 21/450 ≈ 4.67% of doing so. If we randomly select 60 points, all independent of one another, then the probability that none of them land in the target, or in other words all of them miss, is
equation 3.PNG

Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.

In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value.  In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.

There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.

Bayesian Optimization

The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.

  • A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
  • A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.

The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.

There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.

Probability of Improvement

One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:

equation 5.PNG

Where,

equation 6.PNG

In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.

Expected Improvement

Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:

equation 7.PNG

Gaussian Process Upper Confidence Bound

Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:

equation 8.PNG

There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.

Conclusion

Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.



AML Models: Applying Model Validation Principles to Non-Models

Anti-money-laundering (AML) solutions have no business being classified as models. To be sure, AML “models” are sophisticated, complex, and vitally important. But it requires a rather expansive interpretation of the OCC/Federal Reserve/FDIC1 definition of the term model to realistically apply the term to AML solutions.

Supervisory guidance defines model as “a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.”

While AML compliance models are consistent with certain elements of that definition, it is a stretch to argue that these elaborate, business-rule engines are generating outputs that qualify as “quantitative estimates.” They flag transactions and the people who make them, but they do not estimate or predict anything quantitative.

We could spend a lot more time arguing that AML tools (including automated OFAC and other watch-list checks) are not technically models. But in the end, these arguments are moot if an examining regulator holds a differing view. If a bank’s regulator declares the bank’s AML applications to be models and orders that they be validated, then presenting a well-reasoned argument about how these tools don’t rise to the technical definition of a model is not the most prudent course of action (probably).

 

Tailoring Applicable Model Validation Principles to AML Models

What makes it challenging to validate AML “models” is not merely the additional level of effort, it’s that most model validation concepts are designed to evaluate systems that generate quantitative estimates. Consequently, in order to generate a model validation report that will withstand scrutiny, it is important to think of ways to adapt the three pillars of model validation—conceptual soundness review, benchmarking, and back-testing—to the unique characteristics of a non-model.

 

Conceptual Soundness of AML Solutions

The first pillar of model validation—conceptual soundness—is also its most universally applicable. Determining whether an application is well designed and constructed, whether its inputs and assumptions are reasonably sourced and defensible, whether it is sufficiently documented, and whether it meets the needs for which it was developed is every bit as applicable to AML solutions, EUCs and other non-predictive tools as it is to models.

For AML ”models,” a conceptual soundness review generally encompasses the following activities:

  • Documentation review: Are the rule and alert definitions and configurations identified? Are they sufficiently explained and justified? This requires detailed documentation not only from the application vendor, but also from the BSA/AML group within the bank that uses it.
  • Transaction verification: Verifying that all transactions and customers are covered and evaluated by the tool.
  • Risk assessment review: Evaluating the institution’s risk assessment methodology and whether the application’s configurations are consistent with it.
  • Data review: Are all data inputs mapped, extracted, transformed, and loaded correctly from their respective source systems into the AML engine?
  • Watchlist filtering: Are watchlist criteria configured correctly? Is the AML model receiving all the information it needs to generate alerts?

 

Benchmarking (and Process Verification) of AML Tools

Benchmarking is primarily geared toward comparing a model’s uncertain outputs against the uncertain outputs of a challenger model. AML outputs are not particularly well-suited to such a comparison. As such, benchmarking one AML tool against another is not usually feasible. Even in the unlikely event that a validator has access to a separate, “challenger” AML “model,” integrating it with all of a bank’s necessary customer and transaction systems and making sure it works is a months-long project. The nature of AML monitoring—looking at every customer and every single transaction—makes integrating a second, benchmarking engine highly impractical. And even if it were practical, the functionality of any AML system is primarily determined by its calibration and settings. Once the challenger system has been configured to match the system being tested, the objective of the benchmarking exercise is largely defeated.

So, now what? In a model validation context, benchmarking is typically performed and reported in the context of a broader “process verification” exercise—tests to determine whether the model is accomplishing what it purports to. Process verification has broad applicability to AML reviews and typically includes the following components:

  • Above-the-line testing: An evaluation of the alerts triggered by the application and identification of any “false positives” (Type I error).
  • Below-the-line testing: An evaluation of all bank activity to determine whether any transactions that should have been flagged as alerts were missed by the application. These would constitute “false negatives” (Type II error).
  • Documentation comparison: Determination of whether the application is calculating risk scores in a manner consistent with documented methodology.

 

Back-Testing (and Outcomes Analysis) of AML Applications

Because AML applications are not designed to predict the future, the notion of back-testing does not really apply to them. However, in the model validation context, back-testing is typically performed as part of a broader analysis of model outcomes. Here again, a number of AML tests apply, including the following:

  • Rule relevance: How many rules are never triggered? Are there any rules that, when triggered, are always overridden by manual review of the alert?
  • Schedule evaluation: Evaluation of the AML system’s performance testing schedule.
    Distribution analysis: Determining whether the distribution of alerts is logical in light of typical customer transaction activity and the bank’s view of its overall risk profile.
  • Management reporting: How do the AML system’s outputs, including the resulting Suspicious Activity Reports, flow into management reports? How are these reports reviewed for accuracy, presented, and archived?
  • Output maintenance: How are reports created and maintained? How is AML system output archived for reporting and ongoing monitoring purposes?

 

Testing AML Models: Balancing Thoroughness and Practicality

Generally speaking, model validators are given to being thorough. When presented with the task of validating an AML “model,” they are likely to look beyond the limitations associated with applying model validation principles to non-models and focus on devising tests designed to assess whether the AML solution is working as intended.

Left to their own devices, many model validation analysts will likely err on the side of doing more than is necessary to fulfill the requirements of an AML model validation. Devising an approach that aligns effective challenge testing with the three defined pillars of model validation has a dual benefit. It results in a model validation report that maps back to regulatory guidance and is therefore more likely to stand up to scrutiny. It also helps confine the universe of potential testing to only those areas that require testing. Restricting testing to only what is necessary and then thoroughly pursuing that narrowly defined set of tests is ultimately the key to maintaining the effectiveness and efficiency of AML testing in particular and of model risk management programs as a whole.

 


[1] On June 7, 2017, the FDIC formally adopted the Supervisory Guidance previously set forth jointly by the OCC (2011-12) and Federal Reserve (SR 11-7).


Get Started
Log in

Linkedin   

risktech2024