When someone asks you what a model validation is, what is the first thing you think of? If you are like most, then you would immediately think of performance metrics— those quantitative indicators that tell you not only if the model is working as intended, but also its performance and accuracy over time and compared to others. Performance testing is the core of any model validation and generally consists of the following components:

  • Benchmarking
  • Back-testing
  • Sensitivity Analysis
  • Stress Testing

Sensitivity analysis and stress testing, while critical to any model validation’s performance testing, will be covered by a future article. This post will focus on the relative virtues of benchmarking versus back-testing—seeking to define what each is, when and how each should be used, and how to make the best use of the results of each.


Benchmarking is when the validator is providing a comparison of the model being validated to some other model or metric. The type of benchmark utilized will vary, like all model validation performance testing does, with the nature, use, and type of model being validated. Due to the performance information it provides, benchmarking should always be utilized in some form when a suitable benchmark can be found.

Choosing a Benchmark

Choosing what kind of benchmark to use within a model validation can sometimes be a very daunting task. Like all testing within a model validation, the kind of benchmark to use depends on the type of model being tested. Benchmarking takes many forms and may entail comparing the model’s outputs to:

  • The model’s previous version
  • An externally produced model
  • A model built by the validator
  • Other models and methodologies considered by the model developers, but not chosen
  • Industry best practice
  • Thresholds and expectations of the model’s performance

One of the most used benchmarking approaches is to compare a new model’s outputs to those of the version of the model it is replacing. It remains very common throughout the industry for models to be replaced due to a deterioration of performance, change in risk appetite, new regulatory guidance, need to capture new variables, or the availability of new sets of information. In these cases, it is important to not only document but also prove that the new model performs better and does not have the same issues that triggered the old model’s replacement.

Another common benchmarking approach compares the model’s outputs to those of an external “challenger” model (or one built by the validator) which functions with the same objective and data. This approach is likely to return more apt output comparisons than those generated by benchmarking against older versions that are likely to be out of date since the challenger model is developed and updated with the same data as the champion model.

Another benchmark set which could be used for model validation includes other models or methodologies reviewed by the model developers as possibilities for the model being validated but ultimately not used. Model developers as best practice should always list any alternative methodologies, theories, or data which were omitted from the model’s final version. Additionally, model validators should always leverage their experience and understanding of the current best practices throughout the industry, along with any analysis previously completed on similar models. Model validation should then take these alternatives and use them as benchmarks to the model being validated.

Model validators have multiple, distinct ways to incorporate benchmarking into their analysis. The use of the different types of benchmarking discussed here should be based on the type of model, its objective, and the validator’s best judgment. If a model cannot be reasonably benchmarked, then the validator should record why not and discuss the resulting limitations of the validation.


Back-testing is used to measure model outcomes. Here, instead of measuring performance with a comparison, the validator is specifically measuring whether the model is both working as intended and is accurate. Back-testing can take many forms based on the model’s objective. As with benchmarking, back-testing should be a part of every full-scope model validation to the extent possible.

What Back-Tests to Perform

As a form of outcomes analysis, back-testing provides quantitative metrics which measure the performance of a model’s forecast, the accuracy of its estimates, or its ability to rank-order risk. For instance, if a model produces forecasts for a given variable, back-testing would involve comparing the model’s forecast values against actual outcomes, thus indicating its accuracy.

A related function of model back-testing evaluates the ability of a given model to adequately measure risk. This risk could take any of several forms, from the probability of a given borrower to default to the likelihood of a large loss during a given trading day. To back-test a model’s ability to capture risk exposure, it is important first to collect the right data. In order to back-test a probability of default model, for example, data would need to be collected containing cases where borrowers have actually defaulted in order to test the model’s predictions.

Back-testing models that assign borrowers to various risk levels necessitate some special considerations. Back-testing these and other models that seek to rank-order risk involves looking at the model’s performance history and examining its accuracy through its ability to rank and order the risk. This can involve analyzing both Type 1 (false positive) and Type 2 (false negative) statistical errors against the true positive and true negative rates for a given model.  Common statistical tests used for this type of back-testing analysis include, but are not limited to, a Kolmogorov-Smirnov score (KS), a Brier score, or a Receiver Operating Characteristic (ROC).

Benchmarking vs Backtesting

Back-testing measures a model’s outcome and accuracy against real-world observations, while benchmarking measures those outcomes against those of other models or metrics. Some overlap exists when the benchmarking includes comparing how well different models’ outputs back-test against real-world observations and the chosen benchmark. This overlap sometimes leads people to mistakenly conclude that model validations can rely on just one method. In reality, however, back-testing and benchmarking should ideally be performed together in order to bring their individual benefits to bear in evaluating the model’s overall performance. The decision, optimally, should not be whether to create a benchmark or to perform back-testing. Rather, the decision should be what form both benchmarking and back-testing should take.

While benchmarking and back-testing are complementary exercises that should not be viewed as mutually exclusive, their outcomes sometimes appear to produce conflicting results. What should a model validator do, for example, if the model appears to back-test well against real-world observations but do not benchmark particularly well against similar model outputs? What about a model that returns results similar to those of other benchmark models but does not back-test well? In the first” scenario, the model owner can derive a measure of comfort from the knowledge that the model performs well in hindsight. But the owner also runs the very real risk of being “out on an island” if the model turns out to be wrong. The second scenario affords the comfort of company in the model’s projections. But what if the models are all wrong together?

Scenarios where benchmarking and back-testing do not produce complementary results are not common, but they do happen. In these situations, it becomes incumbent on model validators to determine whether back-testing results should trump benchmarking results (or vice-versa) or if they should simply temper one another. The course to take may be dictated by circumstances. For example, a model validator may conclude that macro-economic indicators are changing to the point that a model which back-tests favorably is not an advisable tool because it is not tuned to the expected forward-looking conditions. This could explain why a model that back-tests favorably remains a benchmarking outlier if the benchmark models are taking into account what the subject model is missing. On the other hand, there are scenarios where it is reasonable to conclude that back-testing results trump benchmarking results. After all, most firms would rather have an accurate model than one that lines up with all the others.

As seen in our discussion here, benchmarking and back-testing can sometimes produce distinct or similar metrics depending on the model being validated. While those differences or similarities can sometimes be significant, both benchmarking and back-testing provide critical complementary information about a model’s overall performance. So when approaching a model validation and determining its scope, your choice should be what form of benchmarking and back-testing needs to be done, rather than whether one needs to be performed versus the other.