Model Validation Archives

Applying Machine Learning to Conventional Model Validations

In addition to transforming the way in which financial institutions approach predictive modeling, machine learning techniques are beginning to find their way into how model validators assess conventional, non-machine-learning predictive models. While the array of standard statistical techniques available for validating predictive models remains impressive, the advent of machine learning technology has opened new avenues of possibility for expanding the rigor and depth of insight that can be gained in the course of model validation. In this blog post, we explore how machine learning, in some circumstances, can supplement a model validator’s efforts related to:

Outlier detection on model estimation data
Clustering of data to better understand model accuracy
Feature selection methods to determine the appropriateness of independent variables
The use of machine learning algorithms for benchmarking
Machine learning techniques for sensitivity analysis and stress testing

Outlier Detection

Conventional model validations include, when practical, an assessment of the dataset from which the model is derived. (This is not always practical—or even possible—when it comes to proprietary, third-party vendor models.) Regardless of a model’s design and purpose, virtually every validation concerns itself with at least a cursory review of where these data are coming from, whether their source is reliable, how they are aggregated, and how they figure into the analysis.

Conventional model validation techniques sometimes overlook (or fail to look deeply enough at) the question of whether the data population used to estimate the model is problematic. Outliers—and the effect they may be having on model estimation—can be difficult to detect using conventional means. Developing descriptive statistics and identifying data points that are one, two, or three standard deviations from the mean (i.e., extreme value analysis) is a straightforward enough exercise, but this does not necessarily tell a modeler (or a model validator) which data points should be excluded.

Machine learning modelers use a variety of proximity and projection methods for filtering outliers from their training data. One proximity method employs the K-means algorithm, which groups data into clusters centered around defined “centroids,” and then identifies data points that do not appear to belong to any particular cluster. Common projection methods include multi-dimensional scaling, which allows analysts to view multi-dimensional relationships among multiple data points in just two or three dimensions. Sophisticated model validators can apply these techniques to identify dataset problems that modelers may have overlooked.

Data Clustering

The tendency of data to cluster presents another opportunity for model validators. Machine learning techniques can be applied to determine the relative compactness of individual clusters and how distinct individual clusters are from one another. Clusters that do not appear well defined and blur into one another are evidence of a potentially problematic dataset—one that may result in non-existent patterns being identified in random data. Such clustering could be the basis of any number of model validation findings.

Feature (Variable) Selection

What conventional predictive modelers typically refer to as variables are commonly referred to by machine learning modelers as features. Features and variables serve essentially the same function, but the way in which they are selected can differ. Conventional modelers tend to select variables using a combination of expert judgment and statistical techniques. Machine learning modelers tend to take a more systematic approach that includes stepwise procedures, criterion-based procedures, lasso and ridge regresssion and dimensionality reduction. These methods are designed to ensure that machine learning models achieve their objectives in the simplest way possible, using the fewest possible number of features, and avoiding redundancy. Because model validators frequently encounter black-box applications, directing applying these techniques is not always possible. In some limited circumstances, however, model validators can add to the robustness of their validations by applying machine learning feature selection methods to determine whether conventionally selected model variables resemble those selected by these more advanced means (and if not, why not).

Benchmarking Applications

Identifying and applying an appropriate benchmarking model can be challenging for model validators. Commercially available alternatives are often difficult to (cost effectively) obtain, and building challenger models from scratch can be time-consuming and problematic—particularly when all they do is replicate what the model in question is doing.

While not always feasible, building a machine learning model using the same data that was used to build a conventionally designed predictive model presents a “gold standard” benchmarking opportunity for assessing the conventionally developed model’s outputs. Where significant differences are noted, model validators can investigate the extent to which differences are driven by data/outlier omission, feature/variable selection, or other factors.

Sensitivity Analysis and Stress Testing

The sheer quantity of high-dimensional data very large banks need to process in order to develop their stress testing models makes conventional statistical analysis both computationally expensive and problematic. (This is sometimes referred to as the “curse of dimensionality.”) Machine learning feature selection techniques, described above, are frequently useful in determining whether variables selected for stress testing models are justifiable.

Similarly, machine learning techniques can be employed to isolate, in a systematic way, those variables to which any predictive model is most and least sensitive. Model validators can use this information to quickly ascertain whether these sensitivities are appropriate. A validator, for example, may want to take a closer look at a credit model that is revealed to be more sensitive to, say, zip code, than it is to credit score, debt-to-income ratio, loan-to-value ratio, or any other individual variable or combination of variables. Machine learning techniques make it possible for a model validator to assess a model’s relative sensitivity to virtually any combination of features and make appropriate judgments.

————————–

Model validators have many tools at their disposal for assessing the conceptual soundness, theory, and reliability of conventionally developed predictive models. Machine learning is not a substitute for these, but its techniques offer a variety of ways of supplementing traditional model validation approaches and can provide validators with additional tools for ensuring that models are adequately supported by the data that underlies them.

Applying Model Validation Principles to Machine Learning Models

Machine learning models pose a unique set of challenges to model validators. While exponential increases in the availability of data, computational power, and algorithmic sophistication in recent years has enabled banks and other firms to increasingly derive actionable insights from machine learning methods, the significant complexity of these systems introduces new dimensions of risk.

When appropriately implemented, machine learning models greatly improve the accuracy of predictions that are vital to the risk management decisions financial institutions make. The price of this accuracy, however, is complexity and, at times, a lack of transparency. Consequently, machine learning models must be particularly well maintained and their assumptions thoroughly understood and vetted in order to prevent wildly inaccurate predictions. While maintenance remains primarily the responsibility of the model owner and the first line of defense, second-line model validators increasingly must be able to understand machine learning principles well enough to devise effective challenge that includes:

Analysis of model estimation data to determine the suitability of the machine learning algorithm
Assessment of space and time complexity constraints that inform model training time and scalability
Review of model training/testing procedure
Determination of whether model hyperparameters are appropriate
Calculation of metrics for determining model accuracy and robustness

More than one way exists of organizing these considerations along the three pillars of model validation. Here is how we have come to think about it.

Conceptual Soundness

Many of the concepts of reviewing model theory that govern conventional model validations apply equally well to machine learning models. The question of “business fit” and whether the variables the model lands on are reasonable is just as valid when the variables are selected by a machine as it is when they are selected by a human analyst. Assessing the variable selection process “qualitatively” (does it make sense?) as well as quantitatively (measuring goodness of fit by calculating residual errors, among other tests) takes on particular importance when it comes to machine learning models.

Machine learning does not relieve validators of their responsibility assess the statistical soundness of a model’s data. Machine learning models are not immune to data issues. Validators protect against these by running routine distribution, collinearity, and related tests on model datasets. They must also ensure that the population has been appropriately and reasonably divided into training and holdout/test datasets.

Supplementing these statistical tests should be a thorough assessment of the modeler’s data preparation procedures. In addition to evaluating the ETL process—a common component of all model validations—effective validations of machine learning models take particular notice of variable “scaling” methods. Scaling is important to machine learning algorithms because they generally do not take units into account. Consequently, a machine learning model that relies on borrower income (generally ranging between tens of thousands and hundreds of thousands of dollars), borrower credit score (which generally falls within a range of a few hundred points) and loan-to-value ratio (expressed as a percentage), needs to apply scaling factors to normalize these ranges in order for the model to correctly process each variable’s relative importance. Validators should ensure that scaling and normalizations are reasonable.

Model assumptions, when it comes to machine learning validation, are most frequently addressed by looking at the selection, optimization, and tuning of the model’s hyperparameters. Validators must determine whether the selection/identification process undertaken by the modeler (be it grid search, random search, Bayesian Optimization, or another method—see this blog post for a concise summary of these) is conceptually sound.

Process Verification

Machine learning models are no more immune to overfitting and underfitting (the bias-variance dilemma) than are conventionally developed predictive models. An overfitted model may perform well on the in-sample data, but predict poorly on the out-of-sample data. Complex nonparametric and nonlinear methods used in machine learning algorithms combined with high computing power are likely to contribute to an overfitted machine learning model. An underfitted model, on the other hand, performs poorly in general, mainly due to an overly simplified model algorithm that does a poor job at interpreting the information contained within data.

Cross-validation is a popular technique for detecting and preventing the fitting or “generalization capability” issues in machine learning. In K-Fold cross-validation, the training data is partitioned into K subsets. The model is trained on all training data except the Kth subset, and the Kth subset is used to validate the performance. The model’s generalization capability is low if the accuracy ratios are consistently low (underfitted) or higher on the training set but lower on the validation set (overfitted). Conventional models, such as regression analysis, can be used to benchmark performance.

Outcomes Analysis

Outcomes analysis enables validators to verify the appropriateness of the model’s performance measure methods. Performance measures (or “scoring methods”) are typically specialized to the algorithm type, such as classification and clustering. Validators can try different scoring methods to test and understand the model’s performance. Sensitivity analyses can be performed on the algorithms, hyperparameters, and seed parameters. Since there is no right or wrong answer, validators should focus on the dispersion of the sensitivity results.

Many statistical tactics commonly used to validate conventional models apply equally well to machine learning models. One notable omission is the ability to precisely replicate the model’s outputs. Unlike with an OLS or ARIMA model, for which a validator can reasonably expect to be able to match the model’s coefficients exactly if given the same data, machine learning models can be tested only indirectly—by testing the conceptual soundness of the selected features and assumptions (hyperparameters) and by evaluating the process and outputs. Applying model validation tactics specially tailored to machine learning models allows financial institutions to deploy these powerful tools with greater confidence by demonstrating that they are of sound conceptual design and perform as expected.

Machine Learning Detects Model Validation Blind Spots

Machine learning represents the next frontier in model validation—particularly in the credit and prepayment modeling arena. Financial institutions employ numerous models to make predictions relating to MBS performance. Validating these models by assessing their predictions is of paramount importance, but even models that appear to perform well based upon summary statistics can have subsets of input (input subspaces) for which they tend to perform poorly. Isolating these “blind spots” can be challenging using conventional model validation techniques, but recently developed machine learning algorithms are making the job easier and the results more reliable.

High-Error Subspace Visualization

RiskSpan’s modeling team has developed a statistical algorithm which identifies high-error subspaces and flags model outputs corresponding to inputs originating from these subspaces, indicating to model users that the results might be unreliable. An extension to this problem that we also address is whether migration of data points to more error-prone subspaces of the input space over time can be indicative of macroeconomic regime shifts and signal a need to re-estimate the model. This will aid in the prevention of declining model efficacy over time.

Due to the high-dimensional nature of the input spaces of many financial models, traditional statistical methods of partitioning data may prove inadequate. Using machine learning techniques, we have developed a more robust method of high-error subspace identification. We develop the algorithm using loan performance model data, but the method is adaptable to generic models.

Data Selection and Preparation

The dataset we use for our analysis is a random sample of the publicly available Freddie Mac Loan-Level Dataset. The entire dataset covers the monthly loan performance for loans originated from 1999 to 2016 (25.4 million fixed-rate mortgages). From this set, one million loans were randomly sampled. Features of this dataset include loan-to-value ratio, borrower debt-to-income ratio, borrower credit score, interest rate, and loan status, among others. We aggregate the monthly status vectors for each loan into a single vector which contains a loan status time series over the life of the loan within the historical period. This aggregated status vector is mapped to a value of 1 if the time series indicates the loan was ever 90 days delinquent within the first three years after its origination, representing a default, and 0 otherwise. This procedure results in 914,802 total records.

Algorithm Framework

Using the prepared loan dataset, we estimate a logistic regression loan performance model. The data is sampled and partitioned into training and test datasets for clustering analysis. The model estimation and training data is taken from loans originating in the period from 1999 to 2007, while loans originating in the period from 2008 to 2016 are used for testing. Once the data has been partitioned into training and test sets, a clustering algorithm is run on the training data.

Two-Dimensional Visualization of Select Clusters

The clustering is evaluated based upon its ability to stratify the loan data into clusters that meaningfully identify regions of the input for which the model performs poorly. This requires the average model performance error associated with certain clusters to be substantially higher than the mean. After the training data is assigned to clusters, cluster-level error is computed for each cluster using the logistic regression model. Clusters with high error are flagged based upon a scoring scheme. Each loan in the test set is assigned to a cluster based upon its proximity to the training cluster centers. Loans in the test set that are assigned to flagged clusters are flagged, indicating that the loan comes from a region for which loan performance model predictions exhibit lower accuracy.

Algorithm Performance Analysis

The clustering algorithm successfully flagged high-error regions of the input space, with flagged test clusters exhibiting accuracy more than one standard deviation below the mean. The high errors associated with clusters flagged during model training were persistent over time, with flagged clusters in the test set having a model accuracy of just 38.7%, compared to an accuracy of 92.1% for unflagged clusters. Failure to address observed high-error clusters in the training set and migration of data to high-error subspaces led to substantially diminished model accuracy, with overall model accuracy dropping from 93.9% in the earlier period to 84.1% in the later period.

Training/Test Cluster Error Comparison

Additionally, the nature of default misclassifications and variables with greatest impact on misclassification were also determined. Cluster FICO scores proved to be a strong indicator of cluster model prediction accuracy. While a relatively large proportion of loans in low-FICO clusters defaulted, the logistic regression model substantially overpredicted the number of defaults for these clusters, leading to a large number of Type I errors (inaccurate default predictions) for these clusters. Type II (inaccurate non-default predictions) errors constituted a smaller proportion of overall model error, and their impact was diminished even further when considering their magnitude relative to the number of true negative predictions (accurate non-default predictions), which are far fewer in number than true positive predictions (accurate default predictions).

FICO vs. Cluster Accuracy

Conclusion

Our application of the subspace error identification algorithm to a loan performance model illustrates the dangers of using high-level summary statistics as the sole determinant of model efficacy and failure to consistently monitor the statistical profile of model input data over time. Often, more advanced statistical analysis is required to comprehensively understand model performance. The algorithm identified sets of loans for which the model was systematically misclassifying default status. These large-scale errors come at a high cost to financial institutions employing such models.

As an extension to this research into high error subspace detection, RiskSpan is currently developing machine learning analytics tools that can detect the root cause of systematic model errors and suggest ways to enhance predictive model performance by alleviating these errors.

Back-Testing: Using RS Edge to Validate a Prepayment Model

Most asset-liability management (ALM) models contain an embedded prepayment model for residential mortgage loans. To gauge their accuracy, prepayment modelers typically run a back-test comparing model projections to the actual prepayment rates observed. A standard test is to run a portfolio of loans as of a year ago using the actual interest rates experienced during this time as well as any additional economic factors used by the model such as home price appreciation or the unemployment rate. This methodology isolates the model’s ability to estimate voluntary payoffs from its ability to forecast the economic variables.

The graph below was produced from such a back-test. The residential mortgage loans in the bank’s portfolio as of 10/31/2016 were run through the ALM model (projections) and compared with the observed speeds (actuals). It is apparent that the model did not do a particularly good job forecasting the actual CPRs, as the mean absolute error is 5.0%. Prepayment model validators typically prefer to see mean absolute error rates no higher than 1 to 2%.

Does this mean there is something unique with the bank’s loan portfolio or servicing practices that would cause prepays to deviate from expectations, or does the prepayment model require calibration?

Dissecting the Problem

One strategy is to compare the bank’s prepayment experience to that of the market (see below). The “market” is the universe of comparable loans, in this case residential, conventional loans. This assessment should indicate whether the bank’s portfolio is unique or if it behaves similar to the market. Although this comparison looks better, there are still some material differences, especially at the beginning and end of the time series.

Examining the portfolio composition reveals a number of differences which could be the source of the discrepancy. For example:

Larger-balance loans have a greater refinance incentive.
California loans historically prepay faster than the rest of the country, while New York loans are historically slower.
Broker and correspondent loans typically pay faster than retail originations.

To compensate, the next step is to adjust the market portfolio to more closely mirror the attributes of the bank’s portfolio. Fine-tuning the “market” so that it better aligns with the bank’s channel and geographic breakout, as well as its larger average loan size, results in the following adjusted prepayment speeds.

Conclusion

Prepayments for the bank’s mortgage portfolio track the market speeds reasonably well with no adjustments. Compensating for the differences in composition related to channel, geography, and loan size tracks even better and results in a mean absolute error of only 1.1%. This indicates that there is nothing unique or idiosyncratic with the bank’s portfolio that would cause projections from a market-based prepayment model to deviate significantly from the observed speeds. Consequently, the ALM prepayment model likely needs adjustments to its tuning parameters to better capture the current environment.

Why Model Validation Does Not Eliminate Spreadsheet Risk

Model risk managers invest considerable time in determining which spreadsheets qualify as models, which are end-user computing (EUC) applications, and which are neither. Seldom, however, do model risk managers consider the question of whether a spreadsheet is the appropriate tool for the task at hand.

Perhaps they should start.

Buried in the middle of page seven of the joint Federal Reserve/OCC supervisory guidance on model risk management is this frequently overlooked principle:

“Sound model risk management depends on substantial investment in supporting systems to ensure data and reporting integrity, together with controls and testing to ensure proper implementation of models, effective systems integration, and appropriate use.”

It brings to mind a fairly obvious question: What good is a “substantial investment” in data integrity surrounding the modeling process when the modeling itself is carried out in Excel? Spreadsheets are useful tools, to be sure, but they meet virtually none of the development standards to which traditional production systems are held. What percentage of “spreadsheet models” are subjected to the rigors of the software development life cycle (SDLC) before being put into use?

Model Validation vs. SDLC

More often than not, and usually without realizing it, banks use model validation as a substitute for SDLC when it comes to spreadsheet models. The main problem with this approach is that SDLC and model validation are complementary processes and are not designed to stand in for one another. SDLC is a primarily forward-looking process to ensure applications are implemented properly. Model validation is primarily backward looking and seeks to determine whether existing applications are working as they should.

SDLC includes robust planning, design, and implementation—developing business and technical requirements and then developing or selecting the right tool for the job. Model validation may perform a few cursory tests designed to determine whether some semblance of a selection process has taken place, but model validation is not designed to replicate (or actually perform) the selection process.

This presents a problem because spreadsheet models are seldom if ever built with SDLC principles in mind. Rather, they are more likely to evolve organically as analysts seek increasingly innovative ways of automating business tasks. A spreadsheet may begin as a simple calculator, but as analysts become more sophisticated, they gradually introduce increasingly complex functionality and coding into their spreadsheet. And then one day, the spreadsheet gets picked up by an operational risk discovery tool and the analyst suddenly becomes a model owner. Not every spreadsheet model evolves in such an unstructured way, of course, but more than a few do. And even spreadsheet-based applications that are designed to be models from the outset are seldom created according to a disciplined SDLC process.

I am confident that this is the primary reason spreadsheet models are often so poorly documented. They simply weren’t designed to be models. They weren’t really designed at all. A lot of intelligent, critical thought may have gone into their code and formulas, but little if any thought was likely given to the question of whether a spreadsheet is the best tool for what the spreadsheet has evolved to be able to do.

Challenging the Spreadsheets Themselves

Outside of banking, a growing number of firms are becoming wary of spreadsheets and attempting to move away from them. A Wall Street Journal article last week cited CFOs from companies as diverse as P.F. Chang’s China Bistro Inc., ABM Industries, and Wintrust Financial Corp. seeking to “reduce how much their finance teams use Excel for financial planning, analysis and reporting.”

Many of the reasons spreadsheets are falling out of favor have little to do with governance and risk management. But one core reason will resonate with anyone who has ever attempted to validate a spreadsheet model. Quoting from the article: “Errors can bloom because data in Excel is separated from other systems and isn’t automatically updated.”

It is precisely this “separation” of spreadsheet data from its sources that is so problematic for model validators. Even if a validator can determine that the input data in the spreadsheet is consistent with the source data at the time of validation, it is difficult to ascertain whether tomorrow’s input data will be. Even spreadsheets that pull input data in via dynamic linking or automated feeds can be problematic because the code governing the links and feeds can so easily become broken or corrupted.

An Expanded Way of Thinking About “Conceptual Soundness”

Typically, when model validators speak of evaluating conceptual soundness, they are referring to the model’s underlying theory, how its variables were selected, the reasonableness of its inputs and assumptions, and how well everything is documented. In diving into these details, it is easy to overlook the supervisory guidance’s opening sentence in the Evaluation of Conceptual Soundness section: “This element involves assessing the quality of the model design and construction.”

How often, in assessing a spreadsheet model’s design and construction, do validators ask, “Is Excel even the right application for this?” Not very often, I suspect. When an analyst is assigned to validate a model, the medium is simply a given. In a perfect world, model validators would be empowered to issue a finding along the lines of, “Excel is not an appropriate tool for a high-risk production model of this scope and importance.” Practically speaking, however, few departments will be willing to upend the way they work and analyze data in response to a model validation finding. (In the WSJ article, it took CFOs to affect that kind of change.)

Absent the ability to nudge model owners away from spreadsheets entirely, model validators would do well to incorporate certain additional “best practices” checks into their validation procedures when the model in question is a spreadsheet. These might include the following:

Incorporation of a cover sheet on the first tab of the workbook that includes the model’s name, the model’s version, a brief description of what the model does, and a table of contents defining and describing the purpose of each tab
Application of a consistent color key so that inputs, assumptions, macros, and formulas can be easily identified
Grouping of inputs by source, e.g., raw data versus transformed data versus calculations
Grouping of inputs, processing, and output tabs together by color
Separate instruction sheets for data import and transformation

Spreadsheets present unique challenges to model validators. By accounting for the additional risk posed by the nature of spreadsheets themselves, model risk managers can contribute value by identifying situations where the effectiveness of sound data, theory, and analysis is blunted by an inadequate tool.

AML Models: Applying Model Validation Principles to Non-Models

Anti-money-laundering (AML) solutions have no business being classified as models. To be sure, AML “models” are sophisticated, complex, and vitally important. But it requires a rather expansive interpretation of the OCC/Federal Reserve/FDIC1 definition of the term model to realistically apply the term to AML solutions.

Supervisory guidance defines model as “a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates.”

While AML compliance models are consistent with certain elements of that definition, it is a stretch to argue that these elaborate, business-rule engines are generating outputs that qualify as “quantitative estimates.” They flag transactions and the people who make them, but they do not estimate or predict anything quantitative.

We could spend a lot more time arguing that AML tools (including automated OFAC and other watch-list checks) are not technically models. But in the end, these arguments are moot if an examining regulator holds a differing view. If a bank’s regulator declares the bank’s AML applications to be models and orders that they be validated, then presenting a well-reasoned argument about how these tools don’t rise to the technical definition of a model is not the most prudent course of action (probably).

Tailoring Applicable Model Validation Principles to AML Models

What makes it challenging to validate AML “models” is not merely the additional level of effort, it’s that most model validation concepts are designed to evaluate systems that generate quantitative estimates. Consequently, in order to generate a model validation report that will withstand scrutiny, it is important to think of ways to adapt the three pillars of model validation—conceptual soundness review, benchmarking, and back-testing—to the unique characteristics of a non-model.

Conceptual Soundness of AML Solutions

The first pillar of model validation—conceptual soundness—is also its most universally applicable. Determining whether an application is well designed and constructed, whether its inputs and assumptions are reasonably sourced and defensible, whether it is sufficiently documented, and whether it meets the needs for which it was developed is every bit as applicable to AML solutions, EUCs and other non-predictive tools as it is to models.

For AML ”models,” a conceptual soundness review generally encompasses the following activities:

Documentation review: Are the rule and alert definitions and configurations identified? Are they sufficiently explained and justified? This requires detailed documentation not only from the application vendor, but also from the BSA/AML group within the bank that uses it.
Transaction verification: Verifying that all transactions and customers are covered and evaluated by the tool.
Risk assessment review: Evaluating the institution’s risk assessment methodology and whether the application’s configurations are consistent with it.
Data review: Are all data inputs mapped, extracted, transformed, and loaded correctly from their respective source systems into the AML engine?
Watchlist filtering: Are watchlist criteria configured correctly? Is the AML model receiving all the information it needs to generate alerts?

Benchmarking (and Process Verification) of AML Tools

Benchmarking is primarily geared toward comparing a model’s uncertain outputs against the uncertain outputs of a challenger model. AML outputs are not particularly well-suited to such a comparison. As such, benchmarking one AML tool against another is not usually feasible. Even in the unlikely event that a validator has access to a separate, “challenger” AML “model,” integrating it with all of a bank’s necessary customer and transaction systems and making sure it works is a months-long project. The nature of AML monitoring—looking at every customer and every single transaction—makes integrating a second, benchmarking engine highly impractical. And even if it were practical, the functionality of any AML system is primarily determined by its calibration and settings. Once the challenger system has been configured to match the system being tested, the objective of the benchmarking exercise is largely defeated.

So, now what? In a model validation context, benchmarking is typically performed and reported in the context of a broader “process verification” exercise—tests to determine whether the model is accomplishing what it purports to. Process verification has broad applicability to AML reviews and typically includes the following components:

Above-the-line testing: An evaluation of the alerts triggered by the application and identification of any “false positives” (Type I error).
Below-the-line testing: An evaluation of all bank activity to determine whether any transactions that should have been flagged as alerts were missed by the application. These would constitute “false negatives” (Type II error).
Documentation comparison: Determination of whether the application is calculating risk scores in a manner consistent with documented methodology.

Back-Testing (and Outcomes Analysis) of AML Applications

Because AML applications are not designed to predict the future, the notion of back-testing does not really apply to them. However, in the model validation context, back-testing is typically performed as part of a broader analysis of model outcomes. Here again, a number of AML tests apply, including the following:

Rule relevance: How many rules are never triggered? Are there any rules that, when triggered, are always overridden by manual review of the alert?
Schedule evaluation: Evaluation of the AML system’s performance testing schedule.
Distribution analysis: Determining whether the distribution of alerts is logical in light of typical customer transaction activity and the bank’s view of its overall risk profile.
Management reporting: How do the AML system’s outputs, including the resulting Suspicious Activity Reports, flow into management reports? How are these reports reviewed for accuracy, presented, and archived?
Output maintenance: How are reports created and maintained? How is AML system output archived for reporting and ongoing monitoring purposes?

Testing AML Models: Balancing Thoroughness and Practicality

Generally speaking, model validators are given to being thorough. When presented with the task of validating an AML “model,” they are likely to look beyond the limitations associated with applying model validation principles to non-models and focus on devising tests designed to assess whether the AML solution is working as intended.

Left to their own devices, many model validation analysts will likely err on the side of doing more than is necessary to fulfill the requirements of an AML model validation. Devising an approach that aligns effective challenge testing with the three defined pillars of model validation has a dual benefit. It results in a model validation report that maps back to regulatory guidance and is therefore more likely to stand up to scrutiny. It also helps confine the universe of potential testing to only those areas that require testing. Restricting testing to only what is necessary and then thoroughly pursuing that narrowly defined set of tests is ultimately the key to maintaining the effectiveness and efficiency of AML testing in particular and of model risk management programs as a whole.

[1] On June 7, 2017, the FDIC formally adopted the Supervisory Guidance previously set forth jointly by the OCC (2011-12) and Federal Reserve (SR 11-7).

Machine Learning Model Selection

Machine learning model selection is the second step of the machine learning process, following variable selection and data cleansing. Selecting the right machine learning model is a critical step, as a model which does not appropriately fit the data will yield inaccurate results. Model selection largely depends on the goal of the model – is the purpose to explore the relationship between the variables or to maximize predictive power? In this blog, we cover a few key concepts of machine learning model selection, including parametic vs. non-parametic models, key metrics for managing the variance-bias tradeoff, and an introduction to a few standard machine learning models.

Parametric vs. Non-Parametric Tradeoffs

One of the first choices to be made in the model selection process pertains to our assumption about the shape of the functional relationship between our explanatory variables (our given, or input, variables) and our response variable (the output that we want to predict). When we choose to assume the shape of our model, we are constructing a parametric model, and our problem reduces to estimating a set of measurable factors, known as parameters.¹ One of the most common assumptions is that the data is linear. While we can relax the linear assumption when necessary, we sometimes do not want to assume the shape of the function at all. Non-parametric models help to avoid the case where we incorrectly assume a function that does not match the data. However, a much larger number of observations must be obtained to make non-parametric methods effective, which can be costly or even infeasible.²

In addition to the fact that non-parametric methods are often not practical, there are other tradeoffs to take into consideration. One important tradeoff is between interpretability and flexibility. Since non-parametric models follow the data closely, they often result in abnormally shaped plots, which can be difficult to interpret. If the goal is to make sense of and model the relationship between the explanatory variable and the response, we may be willing to trade some predictive power for a parametric curve that is more understandable. If, however, we are comfortable constructing a “black-box” in hopes of maximizing the predictive power of the model, then non-parametric models may be suitable.Another important tradeoff is that of variance versus bias . Variance, in the context of statistical learning, refers to the amount by which our prediction would change if we had used a different training dataset for our estimation. Bias refers to the error resulting from approximating a complex relationship by using a simplified representation of it. In general, more flexible (non-parametric) methods tend to have higher variance and lower bias, with the opposite being true of less flexible (parametric) models. Ideally though, we want a model that has low variance and low bias. To find it, we most frequently rely on three important tools: R-squared, residual standard error, and diagnostic plots.

R-Squared, Residual Standard Error, and Plots

R-squared—formally, the “coefficient of determination”—measures the amount of variance in the response variable that is explained by the explanatory variables. Constrained between 0 and 1, a very low R-squared can indicate problems with model fit, while a very high R-squared can sometimes indicate overfitting. Residual standard error (RSE) estimates variance in the data. RSE depends on the residual sum of squares—the variation in the data left unexplained after the regression has been run—the number of observations, and the number of explanatory variables.

Graphical plots complement R-squared and RSE. Plots can be as simple as plotting the response variable against a single explanatory variable or against a fitted linear model. This can be useful for detecting non-linearity, but other plots have broader application.

One such plot is the residual plot, which plots the residuals—the difference between the true response variables and the fitted values—and the fitted values themselves. Patterns in residual plots can suggest a lack of model fit, perhaps due to non-constant variance or non-linearity in the data. Outliers and leverage points³ can also be detected through standardized residual, Normal QQ plots, and leverage point/Cook’s distance plots.

Observing these diagnostic plots enables us to make decisions as to what functional form our variables should take. For instance, by taking a logarithmic function (a curved function) of our response variable, we can help to account for non-constant variance in our model, or a non-linear relationship with the explanatory variables. We can also relax the additive assumption in a linear model by adding multiplicative combinations of variables—a technique that helps to model a synergistic relationship between variables.

Machine Learning Models: Shrinkage Methods, Splines, and Decision Trees

Our goal is to determine the model with the highest probability of having realistically generated the data, and we have summarized above the most important metrics that can help us identify such a model. However, it is also important to be aware of several standard models—to know ahead of time which are likely to be most useful.

Shrinkage methods are an alternative to the standard linear model and most notably include ridge and lasso regressions. While these models are similar to ordinary least squares, they include a shrinkage “penalty” which shrinks the coefficients, as an increasing function of their magnitude, toward zero. Through adding this constraint, the model can offer a sizeable reduction in variance in exchange for a slight increase in bias. A tuning parameter—a coefficient on this penalty—can help us fine-tune the amount of variance we want to eliminate, as well as bias we are willing to accept.⁴

If we are looking for a model with more flexibility and predictive power, splines may be an avenue to explore. Splines introduce several “knots” into the model, creating a smooth, continuous line with many different slopes. Unsurprisingly, since splines are much more flexible than linear regression or shrinkage methods, they have a lower bias due to following the data more closely. They also do a better job than polynomial regressions, as they provide more consistent estimates.⁵

A third option is decision trees, which provide more flexibility, but are also highly interpretable due to the way they segment the problem into a hierarchical structure. The idea is to segment the set of possible values for the random variables into a distinct number of regions and make the same prediction for each observation in a particular region. This is generally done using an algorithm to select the most meaningful way to segment the observations, then the next most, and so on. Once this iterative algorithm is complete, we are left with what is usually a complex, hierarchical tree-like structure that can be readily mapped into a highly intuitive visualization. Decision trees can be very useful for their interpretability, ability to model non-linear data, and arguably more realistic approach to modeling human decision-making.

Application to Finance and Mortgage Data

We can use machine learning to answer a wide variety of questions related to finance and mortgage data, but it is crucial to understand the model selection process. Strong domain knowledge can help considerably in knowing what assumptions would be plausible, but a knowledge of diagnostic metrics, as well as the different types of models, their strengths, and weaknesses, can help unlock insights and uncover the logic behind processes—especially when answering questions that have yet to be answered. Whether your goal is to identify which customers are most likely to default on a loan, determine the elasticity of demand for a certain type of loan, or cut out some of the noise in the data, a solid grounding in approaches to model selection can help significantly.

[1] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, Introduction to Statistical Learning (New York: Springer, 2013), 21-22.
[2] James, Witten, Hastie, and Tibshirani, 23.
[3] Outliers are Y values that are unusual given the explanatory variables. Leverage points are X values that are surprising given the response variables.
[4] James, Witten, Hastie, and Tibshirani, 218.
[5] James, Witten, Hastie, and Tibshirani, 276.

End-User Computing Controls – Building an EUC Inventory

An accounting manager at a mid-sized bank recently wondered aloud to us how to approach implementing end-user computing controls (EUC). She had recently become responsible for identifying and overseeing her institution’s unknown number of EUC applications and had obviously given a lot of thought to the types of applications that needed to be identified and what the review process ought to look like. She recognized that a comprehensive inventory would need to be built, but, like so many others in her position, was uncertain of how to go about it.

We reasoned together that her options fell into two broad categories—each of which has benefits and drawbacks.

The first category of inventory-building options we classified as a top-down approach. This begins with identifying all data contained in financial statements or mission-critical management reports and then working backward from there to identify every model, database, spreadsheet, or other application that is used to generate these reports. The second category is a bottom-up approach, which first identifies every single spreadsheet in use at the bank and then determines which of those rise to the level of EUCs and need to be formally and independently reviewed.

Top-Down EUC Inventory Building

The primary advantage of a top-down approach is the comfort of knowing that everything important has been accounted for. An EUC inventory that is built systematically by tracing every figure on every balance sheet, income statement, and footnote back to every spreadsheet that contributed to it is not likely to miss much. Top-down approaches have the added benefit of placing the EUC inventory coordinator firmly in control of the exercise because she knows precisely what she is looking for. “We’re forecasting $23 million in retail deposit runoff next month,” she might observe. “Someone needs to show me the system that generated that figure. And if it’s a spreadsheet, then it needs an EUC review.”

The downside is that this exercise usually turns out to be more complicated than it sounds. One problem with requests that begin with “Somebody needs to show me…” is that “somebody” can often be hard to track down. Also, “somebody” many times is “somebodies.” Individual financial statement line items are often supported by multiple spreadsheets, and those spreadsheets may have data-feed issues of their own. What begins looking like it should be a straightforward exercise quickly evolves into one of those dreaded “spaghetti bowl” problems where attempting to extract a single strand leads to a tangled mess. A single required line item—say, cash required for loan originations in the next 90 days—would likely require input from a half-dozen or more EUCs tracking everything from economic forecasts to pipeline reports for any number of different loan types and origination channels. Before long, the person in charge of end-user computing controls can begin to feel like she’s been placed in charge of auditing not just EUCs, but the entire bank.

Bottom-Up EUC Inventory Building

A more common means to building an EUC inventory is a bottom-up approach that identifies every spreadsheet on the network and then relies on a combination of manual and automated methods to sort them into one of three bins:

Models (which have hopefully already been tagged and classified during a separate model-inventory-building process)
Non-computational/non-relevant spreadsheets (spreadsheets that either contain data only and do not perform calculations or spreadsheets that do not contribute to a quantitative business purpose—e.g., leave schedules, org charts, and fantasy football standings)
EUCs (pretty much everything that does not get filtered into the first two bins)

Identifying all the spreadsheets can be done manually or using an automated “discovery” tool. Even in the very smallest institutions, manual discovery is too big a job for a single person. Typically, individual business unit heads will be tasked with identifying all of the EUCs in use within their various realms and reporting them to a central EUC oversight coordinator. The advantage of this approach is that it enables non-EUC spreadsheets to be filtered out before they get to the central EUC oversight coordinator, which makes that person’s job easier. The disadvantage is that it is unlikely to capture every EUC. Business unit heads are incentivized to apply a sub-optimal set of criteria when determining whether a spreadsheet should be classified as an EUC. They are likely to overlook files that an impartial EUC coordinator might wish to review.

An automated discovery tool avoids this problem by grabbing everything—every spreadsheet in a given shared drive or folder structure and then scanning and evaluating them for formulas and levels of complexity that contribute to an EUC’s risk rating. Automated scanning tools have the dual benefit of enabling central EUC coordinators to peer into how individual business units are using spreadsheets without having to rely on the judgment of business unit heads to determine what is worthy of review. The downside is that, even with all the automated filtering discovery tools are capable of, they are likely to result in the “discovery” of a lot of spreadsheets that ultimately do not need to go through an EUC review. Paradoxically, the more automated the discovery process is, the more manual the winnowing needs to be.

A Hybrid Approach to End-User Computing Controls

As with many things, the best solution probably lies somewhere in the middle—drawing from the benefits of both top-down and bottom-up approaches.

While a pure top-down approach is usually too involved to be practical on its own, elements of a top-down approach can enlighten and facilitate a bottom-up process. For example, a bottom-up process may identify several spreadsheets whose complexity and perceived importance to the departments that use them make them appear to be high-risk EUCs in need of review. However, a top-down review may reveal that these spreadsheets ultimately do not contribute to financial or enterprise-wise management reporting. It could be that the importance of some spreadsheets does not extend far enough beyond the business unit that owns them to require an independent review. Furthermore, being able to connect the dots between spreadsheets that are identified using a bottom-up approach and individual financial statement/management report entries can help ensure that all important entries are accounted for.

A hybrid approach—one that is informed both by an understanding of critical reporting items and a series of comprehensive, automated discovery scans—introduces the virtues of both methods and is most likely to yield an EUC inventory that is both comprehensive and aligned with an institution’s risk profile.

Validating Interest Rate Models

Many model validations—particularly validations of market risk models, ALM models, and mortgage servicing rights valuation models—require validators to evaluate an array of sub-models. These almost always include at least one interest rate model, which are designed to predict the movement of interest rates.

Validating interest rate models (i.e. short-rate models) can be challenging because many different ways of modeling how interest rates change over time (“interest rate dynamics”) have been created over the years. Each approach has advantages and shortcomings, and it is critical to distinguish the limitations and advantages of each of them to understand whether the short-rate model being used is appropriate to the task. This can be accomplished via the basic tenets of model validation—evaluation of conceptual soundness, replication, benchmarking, and outcomes analysis. Applying these concepts to interest rate models, however, poses some unique complications.

A brief Introduction to the Short-Rate Model

In general, a short-rate model solves the short-rate evolution as a stochastic differential equation. Short-rate models can be categorized based on their interest rate dynamics.

A one-factor short-rate model has only one diffusion term. The biggest limitation of one-factor models is that the correlation between two continuously-compound spot rates at two dates is equal to one, which means a shock at a certain maturity is transmitted thoroughly across the curve that is not realistic in the market.

A multi-factor short-rate model, as its name implies, contains more than one diffusion term. Unlike one-factor models, multi-factor models consider the correlation between forward rates, which makes a multi-factor model more realistic and consistent with actual multi-dimension yield curve movements.

Validating Conceptual Soundness

Validating an interest rate model’s conceptual soundness includes reviewing its data inputs, mean-reversion feature, distributions of short rate, and model selection. Reviewing these items sufficiently requires a validator to possess a basic knowledge of stochastic calculus and stochastic differential equations.

Data Inputs

The fundamental data inputs to the interest rate model could be the zero-coupon curve (also known as term structure of interest rates) or the historical spot rates. Let’s take the Hull-White (H-W) one-factor model (H-W: dr_t = k(θ – r_t)dt + σ_tdw_t) as an example. H-W is an affine term structure model, of which analytical tractability is one of its most favorable properties. Analytical tractability is a valuable feature to model validators because it enables calculations to be replicated. We can calibrate the level parameter (θ) and the rate parameter (k) from the inputs curve. Commonly, the volatility parameter (σ_t) can be calibrated from historical data or swaptions volatilities. In addition, the analytical formulas are also available for zero-coupon bonds, caps/floors, and European swaptions.

Mean Reversion

Given the nature of mean reversion, both the level parameter and rate parameter should be positive. Therefore, an appropriate calibration method should be selected accordingly. Note the common approaches for the one-factor model—least square estimation and maximum likelihood estimation—could generate negative results, which are unacceptable by the mean-reversion feature. The model validator should compare different calibration results from different methods to see which method is the best approach for addressing the model assumption.

Short-Rate Distribution and Model Selection

The distribution of the short rate is another feature that we need to consider when we validate the short-rate model assumptions. The original short-rate models—Vasicek and H-W, for example—presume the short rate to be normally distributed, allowing for the possibility of negative rates. Because negative rates were not expected to be seen in the simulated term structures, the Cox-Ingersoll-Ross model (CIR, non-central chi-squared distributed) and Black-Karasinski model (BK, lognormal distributed) were invented to preclude the existence of negative rates. Compared to the normally distributed models, the non-normally distributed models forfeit a certain degree of analytical tractability, which makes validating them less straightforward. In recent years, as negative rates became a reality in the market, the shifted lognormal distributed model was introduced. This model is dependent on the shift size, determining a lower limit in the simulation process. Note there is no analytical formula for the shift size. Ideally, the shift size should be equal to the absolute value of the minimum negative rate in the historical data. However, not every country experienced negative interest rates, and therefore, the shift size is generally determined by the user’s experience by means of fundamental analysis.

The model validator should develop a method to quantify the risk from any analytical judgement. Because the interest rate model often serves as a sub-model in a larger module, the model selection should also be commensurate with the module’s ultimate objectives.

Replication

Effective model validation frequently relies on a replication exercise to determine whether a model follows the building procedures stated in its documentation. In general, the model documentation provides the estimation method and assorted data inputs. The model validator could consider recalibrating the parameters from the provided interest rate curve and volatility structures. This process helps the model validator better understand the model, its limitations, and potential problems.

Ongoing Monitoring & Benchmarking

Interest rate models are generally used to simulate term structures in order to price caps/floors and swaptions and measure the hedge cost. Let’s again take the H-W model as an example. Two standard simulation methods are available for the H-W model: 1) Monte Carlo simulation and 2) trinomial lattice method. The model validator could use these two methods to perform benchmarking analysis against one another.

The Monte Carlo simulation works ideally for the path-dependent interest rate derivatives. The Monte Carlo method is mathematically easy to understand and convenient for implementation. At each time step, a random variable is simulated and added into the interest rate dynamics. A Monte Carlo simulation is usually considered for products that can only be exercised at maturity. Since the Monte Carlo method simulates the future rates, we cannot be sure at which time the rate or the value of an option becomes optimal. Hence, a standard Monte Carlo approach cannot be used for derivatives with early-exercise capability.

On the other hand, we can price early-exercise products by means of the trinomial lattice method. The trinomial lattice method constructs a trinomial tree under the risk-neutral measure, in which the value at each node can be computed. Given the tree’s backward-looking feature, at each node we can compare the intrinsic value (current value) with the backwardly inducted value (continuous value), determining whether to exercise at a given node. The comparison step will keep running backwardly until it reaches the initial node and returns the final estimated value. Therefore, trinomial lattice works ideally for non-path-dependent interest rate derivatives. Nevertheless, lattice can be also implemented for path-dependent derivatives for the purpose of benchmarking.

Normally, we would expect to see that the simulated result from the lattice method is less accurate and more volatile than the result from the Monte Carlo simulation method, because a larger number of simulated paths can be selected in the Monte Carlo method. This will make the simulated result more stable, assuming the same computing cost and the same time step.

Outcomes Analysis

The most straightforward method for outcomes analysis is to perform sensitivity tests on the model’s key drivers. A standardized one-factor short-rate model usually contains three parameters. For the level parameter (θ), we can calibrate the equilibrium rate-level from the simulated term structure and compare with θ. For the mean-reversion speed parameter (k), we can examine the half-life, which equals to ^{ln ⁡(2)}/_k , and compare with the realized half-life from simulated term structure. For the volatility parameter (σ_t), we would expect to see the larger volatility yields a larger spread in the simulated term structure. We can also recalibrate the volatility surface from the simulated term structure to examine if the number of simulated paths is sufficient to capture the volatility assumption.

As mentioned above, an affine term structure model is analytically tractable, which means we can use the analytical formula to price zero-coupon bonds and other interest rate derivatives. We can compare the model results with the market prices, which can also verify the functionality of the given short-rate model.

Conclusion

The popularity of certain types of interest rate models changes as fast as the economy. In order to keep up, it is important to build a wide range of knowledge and continue learning new perspectives. Validation processes that follow the guidelines set forth in the OCC’s and FRB’s Supervisory Guidance on Model Risk Management (OCC 2011-12 and SR 11-7) seek to answer questions about the model’s conceptual soundness, development, process, implementation, and outcomes. While the details of the actual validation process vary from bank to bank and from model to model, an interest rate model validation should seek to address these matters by asking the following questions:

Are the data inputs consistent with the assumptions of the given short-rate model?
What distribution does the interest rate dynamics imply for the short-rate model?
What kind of estimation method is applied in the model?
Is the model analytically tractable? Are there explicit analytical formulas for zero-coupon bond or bond-option from the model?
Is the model suitable for the Monte Carlo simulation or the lattice method?
Can we recalibrate the model parameters from the simulated term structures?
Does the model address the needs of its users?

These are the fundamental questions that we need to think about when we are trying to validate any interest rate model. Combining these with additional questions specific to the individual rate dynamics in use will yield a robust validation analysis that will satisfy both internal and regulatory demands.

AML Model Validation: Effective Process Verification Requires Thorough Documentation

Increasing regulatory scrutiny due to the catastrophic risk associated with anti-money-laundering (AML) non-compliance is prompting many banks to tighten up their approach to AML model validation. Because AML applications would be better classified as highly specialized, complex systems of algorithms and business rules than as “models,” applying model validation techniques to them presents some unique challenges that make documentation especially important.

In addition to devising effective challenges to determine the “conceptual soundness” of an AML system and whether its approach is defensible, validators must determine the extent to which various rules are firing precisely as designed. Rather than commenting on the general reasonableness of outputs based on back-testing and sensitivity analysis, validators must rely more heavily on a form of process verification that requires precise documentation.

Vendor Documentation of Transaction Monitoring Systems

Above-the-line and below-the-line testing—the backbone of most AML transaction monitoring testing—amounts to a process verification/replication exercise. For any model replication exercise to return meaningful results, the underlying model must be meticulously documented. If not, validators are left to guess at how to fill in the blanks. For some models, guessing can be an effective workaround. But it seldom works well when it comes to a transaction monitoring system and its underlying rules. Absent documentation that describes exactly what rules are supposed to do, and when they are supposed to fire, effective replication becomes nearly impossible.

Anyone who has validated an AML transaction monitoring system knows that they come with a truckload of documentation. Vendor documentation is often quite thorough and does a reasonable job of laying out the solution’s approach to assessing transaction data and generating alerts. Vendor documentation typically explains how relevant transactions are identified, what suspicious activity each rule is seeking to detect, and (usually) a reasonably detailed description of the algorithms and logic each rule applies.

This information provided by the vendor is valuable and critical to a validator’s ability to understand how the solution is intended to work. But because so much more is going on than what can reasonably be captured in vendor documentation, it alone provides insufficient information to devise above-the-line and below-the-line testing that will yield worthwhile results.

Why An AML Solution’s Vendor Documentation is Not Enough

Every model validator knows that model owners must supplement vendor-supplied documentation with their own. This is especially true with AML solutions, in which individual user settings—thresholds, triggers, look-back periods, white lists, and learning algorithms—are arguably more crucial to the solution’s overall performance than the rules themselves.

Comprehensive model owner documentation helps validators (and regulatory supervisors) understand not only that AML rules designed to flag suspicious activity are firing correctly, but also that each rule is sufficiently understood by those who use the solution. It also provides the basis for a validator’s testing that rules are calibrated reasonably. Testing these calibrations is analogous to validating the inputs and assumptions of a predictive model. If they are not explicitly spelled out, then they cannot be evaluated.

Here are some examples.

Transaction Input Transformations

Details about how transaction data streams are mapped, transformed, and integrated into the AML system’s database vary by institution and cannot reasonably be described in generic vendor documentation. Consequently, owner documentation needs to fully describe this. To pass model validation muster, the documentation should also describe the review process for input data and field mapping, along with all steps taken to correct inaccuracies or inconsistencies as they are discovered.

Mapping and importing AML transaction data is sometimes an inexact science. To mitigate risks associated with missing fields and customer attributes, risk-based parameters must be established and adequately documented. This documentation enables validators who test the import function to go into the analysis with both eyes open. Validators must be able to understand the circumstances under which proxy data is used in order to make sound judgments about the reasonableness and effectiveness of established proxy parameters and how well they are being adhered to. Ideally, documentation pertaining to transaction input transformation should describe the data validations that are performed and define any error messages that the system might generate.

Risk Scoring Methodologies and Related Monitoring

Specific methodologies used to risk score customers and countries and assign them to various lists (e.g., white, gray, or black lists) also vary enough by institution that vendor documentation cannot be expected to capture them. Processes and standards employed in creating and maintaining these lists must be documented. This documentation should include how customers and countries get on these lists to begin with, how frequently they are monitored once they are on a list, what form that monitoring takes, the circumstances under which they can move between lists, and how these circumstances are ascertained. These details are often known and usually coded (to some degree) in BSA department procedures. This is not sufficient. They should be incorporated in the AML solution’s model documentation and include data sources and a log capturing the history of customers and countries moving to and from the various risk ratings and lists.

Output Overrides

Management overrides are more prevalent with AML solutions than with most models. This is by design. AML solutions are intended to flag suspicious transactions for review, not to make a final judgment about them. That job is left to BSA department analysts. Too often, important metrics about the work of these analysts are not used to their full potential. Regular analysis of these overrides should be performed and documented so that validators can evaluate AML system performance and the justification underlying any tuning decisions based on the frequency and types of overrides.

Successful AML model validations require rule replication, and incompletely documented rules simply cannot be replicated. Transaction monitoring is a complicated, data-intensive process, and getting everything down on paper can be daunting, but AML “model” owners can take stock of where they stand by asking themselves the following questions:

Are my transaction monitoring rules documented thoroughly enough for a qualified third-party validator to replicate them? (Have I included all systematic overrides, such as white lists and learning algorithms?)
Does my documentation give a comprehensive description of how each scenario is intended to work?
Are thresholds adequately defined?
Are the data and parameters required for flagging suspicious transactions described well enough to be replicated?

If the answer to all these questions is yes, then AML solution owners can move into the model validation process reasonably confident that the state of their documentation will not be a hindrance to the AML model validation process.

1 234

Outlier Detection

Data Clustering

Feature (Variable) Selection

Benchmarking Applications

Sensitivity Analysis and Stress Testing

Conceptual Soundness

Process Verification

Outcomes Analysis

Data Selection and Preparation

Algorithm Framework

Algorithm Performance Analysis

Conclusion

Dissecting the Problem

Model Validation vs. SDLC

Challenging the Spreadsheets Themselves

An Expanded Way of Thinking About “Conceptual Soundness”

Tailoring Applicable Model Validation Principles to AML Models

Conceptual Soundness of AML Solutions

Benchmarking (and Process Verification) of AML Tools

Back-Testing (and Outcomes Analysis) of AML Applications

Testing AML Models: Balancing Thoroughness and Practicality

Parametric vs. Non-Parametric Tradeoffs

Machine Learning Models: Shrinkage Methods, Splines, and Decision Trees

Application to Finance and Mortgage Data

Top-Down EUC Inventory Building

Bottom-Up EUC Inventory Building

A Hybrid Approach to End-User Computing Controls

A brief Introduction to the Short-Rate Model

Validating Conceptual Soundness

Data Inputs

Mean Reversion

Short-Rate Distribution and Model Selection

Replication

Ongoing Monitoring & Benchmarking

Outcomes Analysis

Conclusion

Why An AML Solution’s Vendor Documentation is Not Enough

Transaction Input Transformations

Risk Scoring Methodologies and Related Monitoring

Output Overrides

Company

Products

Security & Compliance