Linkedin    Twitter   Facebook

Get Started
Get a Demo

Linkedin    Twitter

Articles Tagged with: Credit Analytics

RiskSpan Adds CRE, C&I Loan Analytics to Edge Platform

ARLINGTON, Va., March 23, 2023 – RiskSpan, a leading technology company and the most comprehensive source for data management and analytics for mortgage and structured products, has announced the addition of commercial real estate (CRE) and commercial and industrial (C&I) loan data intake, valuation, and risk analytics to its award-winning Edge Platform. This enhancement complements RiskSpan’s existing residential mortgage toolbox and provides clients with a comprehensive toolbox for constructing and managing diverse credit portfolios.

Now more than ever, banks and credit portfolio managers need tools to construct well diversified credit portfolios resilient to rate moves and to know the fair market values of their diverse credit assets.

The new support for CRE and C&I loans on the Edge Platform further cements RiskSpan’s position as a single-source provider for loan pricing and risk management analytics across multiple asset classes. The Edge Platform’s AI-driven Smart Mapping (tape cracking) tool lets clients easily work with CRE and C&I loan data from any format. Its forecasting tools let clients flexibly segment loan datasets and apply performance and pricing assumptions by segment to generate cash flows, pricing and risk analytics.

CRE and C&I loans have long been supported by the Edge Platform’s credit loss accounting module, where users provided such loans in the Edge standard data format. The new Smart Mapping support simplifies data intake, and the new support for valuation and risk (including market risk) analytics for these assets makes Edge a complete toolbox for constructing and managing diverse portfolios that include CRE and C&I loans. These tools include cash flow projections with loan-level precision and stress testing capabilities. They empower traders and asset managers to visualize the risks associated with their portfolios like never before and make more informed decisions about their investments.

Comprehensive details of this and other new capabilities are available by requesting a no-obligation demo at


About RiskSpan, Inc. 

RiskSpan offers cloud-native SaaS analytics for on-demand market risk, credit risk, pricing and trading. With our data science experts and technologists, we are the leader in data as a service and end-to-end solutions for loan-level data management and analytics.

Our mission is to be the most trusted and comprehensive source of data and analytics for loans and structured finance investments. Learn more at

RiskSpan Incorporates Flexible Loan Segmentation into Edge Platform

ARLINGTON, Va., March 3, 2023 — RiskSpan, a leading technology company and the most comprehensive source for data management and analytics for residential mortgage and structured products, has announced the incorporation of Flexible Loan Segmentation functionality into its award-winning Edge Platform.

The new functionality makes Edge the only analytical platform offering users the option of alternating between the speed and convenience of rep-line-level analysis and the unmatched precision of loan-level analytics, depending on the purpose of their analysis.

For years, the cloud-native Edge Platform has stood alone in its ability to offer the computational scale necessary to perform loan-level analyses and fully consider each loan’s individual contribution to a mortgage or MSR portfolio’s cash flows. This level of granularity is of paramount importance when pricing new portfolios, taking property-level considerations into account, and managing tail risks from a credit/servicing cost perspective.

Not every analytical use case justifies the computational cost of a full loan-level analysis, however. For situations where speed requirements dictate the use of rep lines (such as for daily or intra-day hedging needs), the Edge Platform’s new Flexible Loan Segmentation affords users the option to perform valuation and risk analysis at the rep line level.

Analysts, traders and investors take advantage of Edge’s flexible calculation specification to run various rate and HPI scenarios, key rate durations, and other calculation-intensive metrics in an efficient and timely manner. Segment-level results run at both loan and rep line level can be easily compared to assess the impacts of each approach. Individual rep lines are easily rolled up to quickly view results on portfolio subcomponents and on the portfolio as a whole.

Comprehensive details of this and other new capabilities are available by requesting a no-obligation demo at

This new functionality is the latest in a series of enhancements that further the Edge Platform’s objective of providing frictionless insight to Agency MBS traders and investors, knocking down barriers to efficient, clear and data-driven valuation and risk assessment.


About RiskSpan, Inc. 

RiskSpan offers cloud-native SaaS analytics for on-demand market risk, credit risk, pricing and trading. With our data science experts and technologists, we are the leader in data as a service and end-to-end solutions for loan-level data management and analytics. Our mission is to be the most trusted and comprehensive source of data and analytics for loans and structured finance investments. Learn more at

RiskSpan’s Snowflake Tutorial Series: Ep. 1

Learn how to create a new Snowflake database and upload large loan-level datasets

The first episode of RiskSpan’s Snowflake Tutorial Series has dropped!

This six-minute tutorial succinctly demonstrates how to:

  1. Set up a new Snowflake #database
  2. Use SnowSQL to load large datasets (28 million #mortgage loans in this example)
  3. Use internal staging (without a #cloud provider)

This is this first in what is expected to be a 10-part tutorial series demonstrating how RiskSpan’s Snowflake integration makes mortgage and structured finance analytics easier than ever before.

Future topics will include:

  • Executing complex queries using python functions in Snowflake’s SQL
  • External Tables (accessing data without a database)
  • OLAP vs OLTP and hybrid tables in Snowflake
  • Time Travel functionality, clone and data replication
  • Normalizing data and creating a single materialized view
  • Dynamic tables data concepts in Snowflake
  • Data share
  • Data masking
  • Snowpark: Data analysis (pandas) functionality in Snowflake

A Practical Approach to Climate Risk
Assessment for Mortgage Finance

Note: The following is the introduction from RiskSpan’s contribution to a series of essays on Climate Risk and the Housing Market published this month by the Mortgage Bankers Association’s Research Institute for Housing America.

Significant uncertainty exists about how climate change will occur, how all levels of government will intervene or react to chronic risks like sea level rise, and how households, companies, and financial markets will respond to various signals that will create movements in prices, demographics, and economic activity even before climate risk manifests. This paper lays out a pragmatic framework for assessing these risks from the perspective of a mortgage company. We evaluate available public and proprietary data sources and address data limitations, such as different sources providing a different view of risk for a particular property. We propose a sensitivity analysis approach to quantify risk and mitigate the uncertainties in measuring and responding to climate change.

Global temperatures will continue to increase over the next 50 years regardless of the actions people and governments take. The impacts of that warming are expected to accumulate and become more severe and frequent over time, causing stress throughout our economy. Regulators are clearly signaling that climate risk analysis will need to become a regular part of risk management activities. But detailed, industry-specific guidance has not been defined. FHFA and the regulated entities have yet to release a climate risk framework. They clearly recognize the threat to the housing finance system, however, and are actively working towards accounting for these risks.

Most executives and boards have become conceptually familiar with the physical and transition risks of climate change. But significant questions remain around how these concepts translate into specific, quantifiable business, asset, regulatory, legal, and reputation risks in the housing finance industry. Further complicating matters, climate science continues to evolve and there is limited historical data to understand how the effects of climate change will trickle into the housing market.

Sean Becketti1 describes the myriad ways climate change and natural hazard risk can permeate the housing and housing finance industries as well as some of the ways to mitigate its effects. However, quantifying these risks and inserting them into mortgage credit and prepayment models comes with significant challenges. No “best practices” have emerged for incorporating these into traditional model frameworks.

This paper puts forth a practical framework to incorporate climate risk into existing enterprise risk management practices for the housing finance industry. The framework incorporates suggestions to prepare for coming regulatory requirements on climate risk and, more importantly, proactively managing and mitigating this risk. Our approach is based on over two years of research and field work RiskSpan has conducted with its clients, and the resulting models RiskSpan has developed to deliver insights into these risks.

The paper is organized into two main sections:

  1. Prescribed Climate Scenarios and Emerging Regulatory Requirements
  2. A Practical Approach to Climate Risk Assessment for Mortgage Finance

Layering climate risk into enterprise risk management is likely to be a multiyear process. This paper focuses on steps to take in the initial one to two years after climate risk has been prioritized for investment of time and resources by corporate leadership. As explained in an MBA white paper from June 2022,2 “Existing risk management practices, structures, and relationships are already capturing potential risks from climate change.” The aim of this paper is to investigate specific ways in which existing credit, operational, and market risk frameworks can be leveraged to address this challenge, rather than seeking to reinvent the wheel.

Temporary Buydowns are Back. What Does This Mean for Speeds?

Mortgage buydowns are having a deja-vu moment. Some folks may recall mortgages with teaser rates in the pre-crisis period. Temporary buydowns are similar in concept. Recent declines notwithstanding, mortgage rates are still higher than they have been in years. Housing remains pricey. Would-be home buyers are looking for any help they can get. While on the other hand, with an almost non-existent refi market, mortgage originators are trying to find innovative ways to keep the production machine going. Conditions are ripe for lender and/or builder concessions that will help close the deal.

Enter the humble “temporary” mortgage interest rate buydown. A HousingWire article last month addressed the growing trend. It’s hard to turn on the TV without being bombarded with ads for Rocket Mortgage’s “Inflation Buster” program. Rocket Mortgage doesn’t use the term temporary buydown in its TV spots, but that is what it is.

Buydowns, in general, refer to when a borrower pays “points” upfront to reduce the mortgage rate to a level where they can afford the monthly payment. The mortgage rate has been “bought down” from its original rate for the entire life of the mortgage by paying a lumpsum upfront. Temporary Buydowns, on the other hand, come in various shapes and sizes, but the most common ones are a “2 – 1” (a 2-percent interest rate reduction in the first year and a 1-percent reduction in year two) and a “1 – 0” (a 1-percent interest rate reduction in the first year only). In these situations, the seller, or the builder, or the lender or a combination thereof put-up money to cover the difference in interest rate payments between the original mortgage rate and the reduced mortgage rate. In the 2-1 example above, the mortgage rate is reduced by 2% for the first year and then steps up by 1% in the second year and then steps up by another 1% in the 3rd year to reach the actual mortgage rate at origination. So, the interest portion of the monthly mortgage payments are “subsidized” for the first two years and then revert to the full monthly payment. Given the inflated rental market, these programs can make purchasing more advantageous than renting (for home seekers trying to decide between the two options). They can also make purchasing a home more affordable (temporarily, at least) for would-be buyers who can’t afford the monthly payment at the prevailing mortgage rate. It essentially buys them time to refinance into a lower rate should interest rates fall over the subsidized time frame or they may be expecting increased income (raises, business revenue) in the future which will allow them to afford the unsubsidized monthly payment.

Temporary buydowns present an interesting situation for prepayment and default modelers. Most borrowers with good credit behave similarly to refinance incentives, barring loan size and refi cost issues. While permanent buydowns tend to exhibit slower speeds when they come in the money by a small amount since the borrower needs to make a cost/benefit decision about recouping the upfront money they put down and the refi costs associated with the new loan. Their breakeven point is going to be lower by 25bps or 50bps from their existing mortgage rate. So, their response to mortgage rates dropping will be slower than borrowers with similar mortgage rates who didn’t pay points upfront. Borrowers with temporary buydowns will be very sensitive to any mortgage rate drops and will refinance at the first opportunity to lock in a lower rate before the “subsidy” expires. Hence, such mortgages are expected to prepay at higher speeds then other counterparts with similar rates. In essence, they behave like ARMs when they approach their reset dates.

When rates stay static or increase, temporary buydowns will behave like their counterparts except when they get close to the reset dates and will see faster speeds. Two factors would contribute to this phenomenon. The most obvious reason is that temporary buydown borrowers will want to refinance into the lowest rate available at the time of reset (perhaps an ARM).  The other possibility is that some of these borrowers may not be able refi because of DTI issues and may default. Such borrowers may also be deemed “weaker credits” because of the subsidy that they received. This increase in defaults would elevate their speeds (increased CBRs) relative to their counterparts.

So, for the reasons mentioned above, temporary buydown mortgages are expected to be the faster one among the same mortgage rate group. In the table below we separate borrowers with the same mortgage rate into 3 groups: 1) those that got a normal mortgage at the prevailing rate and paid no points, 2) those that paid points upfront to get a permanent lower rate and 3) those who got temporary lower rates subsidized by the seller/builder/lender. Obviously, the buydowns occurred in higher rate environments but we are considering 3 borrower groups with the same mortgage rate regardless of how they got that rate. We are assuming that all 3 groups of borrowers currently have a 6% mortgage. We present the expected prepay behavior of all 3 groups in different mortgage rate environments:

*Turnover++ means faster due to defaults or at reset
 Rate Rate Shift 6% (no pts)

Buydown to 6%(borrower-paid)

Buydown to 6% (lender-paid)  
7.00% +100 Turnover Turnover Turnover++*  
6.00% Flat Turnover Turnover Faster (at reset)  
5.75% -25 Refi Turnover Refi  
5.00% -100 Refi (Faster) Refi (Fast) Refi (Fastest)  

Overall, temporary buydowns are likely to exhibit the most rate sensitivity. As their mortgage rates reset higher, they will behave like ARMs and refi into any other lower rate option (5/1 ARM) or possibly default. In the money, they will be the quickest to refi.

Contact Us

Incorporating Covid-Era Mortgage Data Without Skewing Your Models

What we observed during Covid represents a radical departure from what we observed pre-Covid. To what extent do these observations impact long-term trends observed for mortgage performance? Should these data fundamentally impact the way in which we think about the effects borrower, loan and macroeconomic characteristics have on mortgage performance? Or do we need to simply account for them as a short-term blip?

The process of modeling mortgage defaults and prepayments typically begins with identifying long-term trends and reference values. These aid in creating the baseline forecasts that undergird the model in its most simplistic form. Modelers then begin looking for deviations from this baseline created by specific loan, borrower, and property characteristics, as well as by key macroeconomic variables.

Identifying these relationships enables modelers to begin quantifying the extent to which micro factors like income, credit score, and loan-to-value ratios interact with macro indicators like the unemployment rate to cause prepayments and defaults to depart from their baseline. Data observations aggregated over extended periods give a comprehensive picture possible of these relationships.

In practice, the human behavior underlying these and virtually all economic models tends to change over time. Modelers account for this by making short-term corrections based on observations from the most recent time periods. This approach of tweaking long-term trends based on recent performance works reasonably well under most circumstances. One could reasonably argue, however, that tweaking existing models using performance data collected during the Covid-19 era presents a unique set of challenges.

What was observed during Covid represents a radical departure from what was observed pre-Covid. To what extent do these observations impact long-term trends and reference values. Should these data fundamentally impact the way in which we think about the effects borrower, loan and macroeconomic characteristics have on mortgage performance? Or do we need to simply account for them as a short-term blip?


How Covid-era mortgage data differs

When it comes to modeling mortgage performance, we generally think of three sets of factors: 1) macroeconomic conditions, 2) loan and borrower characteristics, and 3) property characteristics. In determining how to account for Covid-era data in our modeling, we first must attempt to evaluate its impact on these factors.

Three macroeconomic factors have played an especially significant role recently. First, as reflected in the chart below, we experienced a significant home-price decline during the 2008 financial crisis but a steady increase since then.

Second, mortgage rates continued to decline for the most part during the crisis and beyond. There were brief periods when they increased, but they remained low by and large.

The third piece is the unemployment rate. Unemployment spiked to around 10 percent during the financial crisis and then slowly declined.

When home prices declined in the past, we typically saw the government attempt to respond to it by reducing interest rates. This created something of a correlation between home prices and mortgage rates. Looking at this from a purely statistical viewpoint, the only thing the historical data shows is that falling home prices bring about a decline in mortgage rates. (And rising home prices bring about higher interest rates, though to a far lesser degree.) We see something similar with unemployment. Falling unemployment is correlated with rising home prices.

But then Covid arrives and with it some things we had not observed previously. All the “known” correlations among these macroeconomic variables broke down. For example, the unemployment rate spikes to 15 percent within just a couple of months and yet has no negative impact at all on home prices. Home prices, in fact, continue to rise, supported by the very generous unemployment benefits provided during Covid pandemic.

This greatly complicates the modeling. Here we had these variable relationships that appeared steady over a period of decades, and all of our modeling was being done (knowingly or unknowingly) relying on these correlations, and suddenly all these correlations are breaking down.

What does this mean for forecasting prepayments? The following chart shows prepayments over time by vintage. We see extremely high prepayment rates between early 2020 (the start of the pandemic) and early 2022 (when rates started rising). This makes sense.

Look at what happens to our forecasts, however, when rates begin to increase. The following chart reflects the models predicting a much steeper drop-off in prepayments than what was actually observed for a July 2021 issuance Fannie Mae major of coupon 2.0. These mortgage loans with no refinance incentive are prepaying faster than what would be expected based on the historical data.

What is causing this departure?

The most plausible explanation relates to an observed increase in cash-out refinances caused by the recent run-up in home prices and resulting in many homeowners suddenly finding themselves with a lot of home equity to tap into.  Pre-Covid , cash-outs accounted for between a third and a quarter of refinances. Now, with virtually no one in the money for a rate-and-term refinance, cash-outs are accounting for over 80 percent of them.

We learn from this that we need to incorporate the amount of home equity gained by borrowers into our prepayment modeling.

 Modeling Credit Performance

Of course, Covid’s impacts were felt even more acutely in delinquency rates than in prepays. As the following chart shows, a borrower that was 1-month delinquent during Covid had a 75 percent probability of being 2-months delinquent the following month.

This is clearly way outside the norm of what was observed historically and compels us to ask some hard questions when attempting to fit a model to this data.

The long-term average of “two to worse” transitions (the percentage of 60-day delinquencies that become 90-day delinquencies (or worse) the following month) is around 40 percent. But we’re now observing something closer to 50 percent. Do we expect this to continue in the future, or do we expect it to revert back to the longer-term average. We observe a similar issue in other transitions, as illustrated below. The rates appear to be stabilizing at higher levels now relative to where they were pre-Covid. This is especially true of more serious delinquencies.

How do we respond to this? What is the best way to go about combining this pre-Covid and post-Covid data?

Principles for handling Covid-era mortgage data

One approach would be to think about Covid data as outliers that should be ignored. At the other extreme, we could simply accept the observed data and incorporate it without any special considerations. A split-the-difference third approach would have us incorporate the new data with some sort of weighting factor for use in future stress scenarios without completely casting aside the long-term reference values that had stood the test of time prior to the pandemic.

This third approach requires us to apply the following guiding principles:

  1. Assess assumed correlations between driving macro variables: For example, don’t allow the model to assume that increasing unemployment will lead to higher home prices just because it happened once during a pandemic.
  2. Choose short-term calibrations carefully. Do not allow models to be unduly influenced by blindly giving too much weight to what has happened in the past two years.
  3. Determine whether the new data in fact reflects a regime shift. How long will the new regime last?
  4. Avoid creating a model that will break down during future unusual periods.
  1. Prepare for other extremes. Incorporate what was learned into future stress testing
  1. Build models that allow sensitivity analyses and are easy to change/tune. Models need to be sufficiently flexible that they can be tuned in response to macroeconomic events in a matter of weeks, rather than taking months or years to design and build an entirely new model.

Covid-era mortgage data presents modelers with a unique challenge. How to appropriately consider it without overweighting it. These general guidelines are a good place to start. For ideas specific to your portfolio, contact a RiskSpan representative.


How Rithm Capital leverages RiskSpan’s expertise and Edge Platform to enhance data management and achieve economies of scale




One of the nation’s largest mortgage loan and MSR investors was hampered by a complex data ingestion process as well as slow and cumbersome on-prem software for pricing and market risk.

A complicated data wrangling process was taking up significant time and led to delays in data processing. Further, month-end risk and financial reporting processes were manual and time-pressured. The data and risk teams were consumed with maintaining the day-to-day with little time available to address longer-term data strategies and enhance risk and modeling processes.



  1. Modernize Rithm’s mortgage loan and MSR data intake from servicers — improve overall quality of data through automated processes and development of a data QC framework that would bring more confidence in the data and associated use cases, such as for calculating historical performance.

  2. Streamline portfolio valuation and risk analytics while enhancing granularity and flexibility through loan-level valuation/risk.

  3. Ensure data availability for accounting, finance and other downstream processes.

  4. Bring scalability and internal consistency to all of the processes above.



By adopting RiskSpan’s cloud-native data management, managed risk, and SaaS solutions, Rithm Capital saved time and money by streamlining its processes

Adopting Edge has enabled Rithm to access enhanced and timely data for better performance tracking and risk management by:

  • Managing data on 5.5 million loans, including source information and monthly updates from loan servicers (with ability in the future to move to daily updates)
  • Ingesting, validating and normalizing all data for consistency across servicers and assets
  • Implementing automated data QC processes
  • Performing granular, loan-level analysis​


With more than 5 million mortgage loans spread across nine servicers, Rithm needed a way to consume data from different sources whose file formats varied from one another and also often lacked internal consistency. Data mapping and QC rules constantly had to be modified to keep up with evolving file formats. 

Once the data was onboarded Rithm required an extraordinary amount of compute power to run stochastic paths of Monte Carlo rate simulations on all 4 million of those loans individually and then discount the resulting cash flows based on option adjusted yield across multiple scenarios.

To help minimize the computing workload, Rithm had been running all these daily analytics at a rep-line level—stratifying and condensing everything down to between 70,000 and 75,000 rep lines. This alleviated the computing burden but at the cost of decreased accuracy and limited reporting flexibility because results were not at the loan-level.

Enter RiskSpan’s Edge Platform.

Combining the strength of RiskSpan’s subject matter experts, quantitative analysts, and technologists together with the power of the Edge platform, RiskSpan has helped Rithm achieve its objectives across the following areas: 

Data management and performance reporting

  • Data intake and quality control for 9 servicers across loan and MSR portfolios
  • Servicer data enrichment
  • Automated data loads leading to reduced processing time for rolling tapes
  • Ongoing data management support and resolution
  • Historical performance review and analysis (portfolio and universe)

Valuation and risk

  • Daily reporting of MSR, mortgage loan and security valuation and risk analytics based on customized Tableau reports
  • MSR and whole loan valuation/risk calculated based at the loan-level leveraging the scalability of the cloud-native infrastructure
  • Additional scenario analysis and other requirements needed for official accounting and valuation purposes

Interactive tools for portfolio management

  • Fast and accurate tape cracking for purchase/sale decision support
  • Ad-hoc scenario analyses based on customized dials and user-settings

The implementation of these enhanced data and analytics processes and increased ability to scale these processes has allowed Rithm to spend less time on day-to-day data wrangling and focus more on higher-level data analysis and portfolio management. The quality of data has also improved, which has led to more confidence in the data that is used across many parts of the organization.


Models + Data management = End-to-end Managed Process

The economies of scale we have achieved by being able to consolidate all of our portfolio risk, interactive analytics, and data warehousing onto a single platform are substantial. RiskSpan’s experience with servicer data and MSR analytics have been particularly valuable to us.

          — Head of Analytics

Optimizing Analytics Computational Processing 

We met with RiskSpan’s Head of Engineering and Development, Praveen Vairavan, to understand how his team set about optimizing analytics computational processing for a portfolio of 4 million mortgage loans using a cloud-based compute farm.

This interview dives deeper into a case study we discussed in a recent interview with RiskSpan’s co-founder, Suhrud Dagli.

Here is what we learned from Praveen. 

Speak to an Expert

Could you begin by summarizing for us the technical challenge this optimization was seeking to overcome? 

PV: The main challenge related to an investor’s MSR portfolio, specifically the volume of loans we were trying to run. The client has close to 4 million loans spread across nine different servicers. This presented two related but separate sets of challenges. 

The first set of challenges stemmed from needing to consume data from different servicers whose file formats not only differed from one another but also often lacked internal consistency. By that, I mean even the file formats from a single given servicer tended to change from time to time. This required us to continuously update our data mapping and (because the servicer reporting data is not always clean) modify our QC rules to keep up with evolving file formats.  

The second challenge relates to the sheer volume of compute power necessary to run stochastic paths of Monte Carlo rate simulations on 4 million individual loans and then discount the resulting cash flows based on option adjusted yield across multiple scenarios. 

And so you have 4 million loans times multiple paths times one basic cash flow, one basic option-adjusted case, one up case, and one down case, and you can see how quickly the workload adds up. And all this needed to happen on a daily basis. 

To help minimize the computing workload, our client had been running all these daily analytics at a rep-line level—stratifying and condensing everything down to between 70,000 and 75,000 rep lines. This alleviated the computing burden but at the cost of decreased accuracy because they couldn’t look at the loans individually. 

What technology enabled you to optimize the computational process of running 50 paths and 4 scenarios for 4 million individual loans?

PV: With the cloud, you have the advantage of spawning a bunch of servers on the fly (just long enough to run all the necessary analytics) and then shutting it down once the analytics are done. 

This sounds simple enough. But in order to use that level of compute servers, we needed to figure out how to distribute the 4 million loans across all these different servers so they can run in parallel (and then we get the results back so we could aggregate them). We did this using what is known as a MapReduce approach. 

Say we want to run a particular cohort of this dataset with 50,000 loans in it. If we were using a single server, it would run them one after the other – generate all the cash flows for loan 1, then for loan 2, and so on. As you would expect, that is very time-consuming. So, we decided to break down the loans into smaller chunks. We experimented with various chunk sizes. We started with 1,000 – we ran 50 chunks of 1,000 loans each in parallel across the AWS cloud and then aggregated all those results.  

That was an improvement, but the 50 parallel jobs were still taking longer than we wanted. And so, we experimented further before ultimately determining that the “sweet spot” was something closer to 5,000 parallel jobs of 100 loans each. 

Only in the cloud is it practical to run 5,000 servers in parallel. But this of course raises the question: Why not just go all the way and run 50,000 parallel jobs of one loan each? Well, as it happens, running an excessively large number of jobs carries overhead burdens of its own. And we found that the extra time needed to manage that many jobs more than offset the compute time savings. And so, using a fair bit of trial and error, we determined that 100-loan jobs maximized the runtime savings without creating an overly burdensome number of jobs running in parallel.  

Get A Demo

You mentioned the challenge of having to manage a large number of parallel processes. What tools do you employ to work around these and other bottlenecks? 

PV: The most significant bottleneck associated with this process is finding the “sweet spot” number of parallel processes I mentioned above. As I said, we could theoretically break it down into 4 million single-loan processes all running in parallel. But managing this amount of distributed computation, even in the cloud, invariably creates a degree of overhead which ultimately degrades performance. 

And so how do we find that sweet spot – how do we optimize the number of servers on the distributed computation engine? 

As I alluded to earlier, the process involved an element of trial and error. But we also developed some home-grown tools (and leveraged some tools available in AWS) to help us. These tools enable us to visualize computation server performance – how much of a load they can take, how much memory they use, etc. These helped eliminate some of the optimization guesswork.   

Is this optimization primarily hardware based?

PV: AWS provides essentially two “flavors” of machines. One “flavor” enables you to take in a lot of memory. This enables you to keep a whole lot of loans in memory so it will be faster to run. The other flavor of hardware is more processor based (compute intensive). These machines provide a lot of CPU power so that you can run a lot of processes in parallel on a single machine and still get the required performance. 

We have done a lot of R&D on this hardware. We experimented with many different instance types to determine which works best for us and optimizes our output: Lots of memory but smaller CPUs vs. CPU-intensive machines with less (but still a reasonably amount of) memory. 

We ultimately landed on a machine with 96 cores and about 240 GB of memory. This was the balance that enabled us to run portfolios at speeds consistent with our SLAs. For us, this translated to a server farm of 50 machines running 70 processes each, which works out to 3,500 workers helping us to process the entire 4-million-loan portfolio (across 50 Monte Carlo simulation paths and 4 different scenarios) within the established SLA.  

What software-based optimization made this possible? 

PV: Even optimized in the cloud, hardware can get pricey – on the order of $4.50 per hour in this example. And so, we supplemented our hardware optimization with some software-based optimization as well. 

We were able to optimize our software to a point where we could use a machine with just 30 cores (rather than 96) and 64 GB of RAM (rather than 240). Using 80 of these machines running 40 processes each gives us 2,400 workers (rather than 3,500). Software optimization enabled us to run the same number of loans in roughly the same amount of time (slightly faster, actually) but using fewer hardware resources. And our cost to use these machines was just one-third what we were paying for the more resource-intensive hardware. 

All this, and our compute time actually declined by 10 percent.  

The software optimization that made this possible has two parts: 

The first part (as we discussed earlier) is using the MapReduce methodology to break down jobs into optimally sized chunks. 

The second part involved optimizing how we read loan-level information into the analytical engine.  Reading in loan-level data (especially for 4 million loans) is a huge bottleneck. We got around this by implementing a “pre-processing” procedure. For each individual servicer, we created a set of optimized loan files that can be read and rendered “analytics ready” very quickly. This enables the loan-level data to be quickly consumed and immediately used for analytics without having to read all the loan tapes and convert them into a format that analytics engine can understand. Because we have “pre-processed” all this loan information, it is immediately available in a format that the engine can easily digest and run analytics on.  

This software-based optimization is what ultimately enabled us to optimize our hardware usage (and save time and cost in the process).  

Contact us to learn more about how we can help you optimize your mortgage analytics computational processing.

Rethink Analytics Computational Processing – Solving Yesterday’s Problems with Today’s Technology and Access 

We sat down with RiskSpan’s co-founder and chief technology officer, Suhrud Dagli, to learn more about how one mortgage investor successfully overhauled its analytics computational processing. The investor migrated from a daily pricing and risk process that relied on tens of thousands of rep lines to one capable of evaluating each of the portfolio’s more than three-and-a-half million loans individually (and how they actually saved money in the process).  

Here is what we learned. 

Could you start by talking a little about this portfolio — what asset class and what kind of analytics the investor was running? 

SD: Our client was managing a large investment portfolio of mortgage servicing rights (MSR) assets, residential loans and securities.  

The investor runs a battery of sophisticated risk management analytics that rely on stochastic modeling. Option-adjusted spread, duration, convexity, and key rate durations are calculated based on more than 200 interest rate simulations. 


Why was the investor running their analytics computational processing using a rep line approach? 

SD: They used rep lines for one main reason: They needed a way to manage computational loads on the server and improve calculation speeds. Secondarily, organizing the loans in this way simplified their reporting and accounting requirements to a degree (loans financed by the same facility were grouped into the same rep line).  

This approach had some downsides. Pooling loans by finance facility was sometimes causing loans with different balances, LTVs, credit scores, etc., to get grouped into the same rep line. This resulted in prepayment and default assumptions getting applied to every loan in a rep line that differed from the assumptions that likely would have been applied if the loans were being evaluated individually.  

The most obvious solution to this would seem to be one that disassembles the finance facility groups into their individual loans, runs all those analytics at the loan level, and then re-aggregates the results into the original rep lines. Is this sort of analytics computational processing possible without taking all day and blowing up the server? 

SD: That is effectively what we are doing. The process is not a speedy as we’d like it to be (and we are working on that). But we have worked out a solution that does not overly tax computational resources.  

The analytics computational processing we are implementing ignores the rep line concept entirely and just runs the loans. The scalability of our cloud-native infrastructure enables us to take the three-and-a-half million loans and bucket them equally for computation purposes. We run a hundred loans on each processor and get back loan-level cash flows and then generate the output separately, which brings the processing time down considerably. 


So we have a proof of concept that this approach to analytics computational processing works in practice for running pricing and risk on MSR portfolios. Is it applicable to any other asset classes?

SD: The underlying principles that make analytics computational processing possible at the loan level for MSR portfolios apply equally well to whole loan investors and MBS investors. In fact, the investor in this example has a large whole-loan portfolio alongside its MSR portfolio. And it is successfully applying these same tactics on that portfolio.   

An investor in any mortgage asset benefits from the ability to look at and evaluate loan characteristics individually. The results may need to be rolled up and grouped for reporting purposes. But being able to run the cash flows at the loan level ultimately makes the aggregated results vastly more meaningful and reliable. 

A loan-level framework also affords whole-loan and securities investors the ability to be sure they are capturing the most important loan characteristics and are staying on top of how the composition of the portfolio evolves with each day’s payoffs. 

ESG factors are an important consideration for a growing number of investors. Only a loan-level approach makes it possible for these investors to conduct the kind of property- and borrower-level analyses to know whether they are working toward meeting their ESG goals. It also makes it easier to spot areas of geographic concentration risk, which simplifies climate risk management to some degree.  

Say I am a mortgage investor who is interested in moving to loan-level pricing and risk analytics. How do I begin? 

 SD: Three things: 

  1.  It begins with having the data. Most investors have access to loan-level data. But it’s not always clean. This is especially true of origination data. If you’re acquiring a pool – be it a seasoned pool or a pool right after origination – you don’t have the best origination data to drive your model. You also need a data store that can generate loan-loan level output to drive your analytics and models.
  2. The second factor is having models that work at the loan level – models that have been calibrated using loan-level performance and that are capable of generating loan-level output. One of the constraints of several existing modeling frameworks developed by vendors is they were created to run at a rep line level and don’t necessarily work very well for loan-level projections.  
  3. The third thing you need is a compute farm. It is virtually impossible to run loan-level analytics if you’re not on the cloud because you need to distribute the computational load. And your computational distribution requirements will change from portfolio to portfolio based on the type of analytics that you are running, based on the types of scenarios that you are running, and based on the models you are using. 

The cloud is needed not just for CPU power but also for storage. This is because once you go to the loan level, every loan’s data must be made available to every processor that’s performing the calculation. This is where having the kind of shared databases, which are native to a cloud infrastructure, becomes vital. You simply can’t replicate it using a on-premise setup of computers in your office or in your own data center. 

So, 1) get your data squared away, 2) make sure you’re using models that are optimized for loan-level, and 3) max out your analytics computational processing power by migrating to cloud-native infrastructure. Thank you, Suhrud, for taking the time to speak with us.

“Reject Inference” Methods in Credit Modeling: What are the Challenges?

Reject inference is a popular concept that has been used in credit modeling for decades. Yet, we observe in our work validating credit models that the concept is still dynamically evolving. The appeal of reject inference, whose aim is to develop a credit scoring model utilizing all available data, including that of rejected applicants, is easy enough to grasp. But the technique also introduces a number of fairly vexing challenges.

The technique seeks to rectify a fundamental shortcoming in traditional credit modeling: Models predicting the probability that a loan applicant will repay the loan can be trained to historical loan application data with a binary variable representing whether a loan was repaid or charged off. This information, however, is only available for accepted applications. And many of these applications are not particularly recent. This limitation results in a training dataset that may not be representative of the broader loan application universe.

Credit modelers have devised several techniques for getting around this data representativeness problem and increasing the number of observations by inferring the repayment status of rejected loan applications. These techniques, while well intentioned, are often treated empirically and lack a deeper theoretical basis. They often result in “hidden” modeling assumptions, the reasonableness of which is not fully investigated. Additionally, no theoretical properties of the coefficient estimates, or predictions are guaranteed.

This article summarizes the main challenges of reject inference that we have encountered in our model validation practice.


Selecting the Right Reject Inference Method

Many approaches exist for reject inference, none of which is clearly and universally superior to all the others. Empirical studies have been conducted to compare methods and pick a winner, but the conclusions of these studies are often contradictory. Some authors argue that reject inference cannot improve scorecard models[1]and flatly recommend against their use. Others posit that certain techniques can outperform others[2] based on empirical experiments. The results of these experiments, however, tend to be data dependent. Some of the most popular approaches include the following:

  • Ignoring rejected applications: The simplest approach is to develop a credit scoring model based only on accepted applications. The underlying assumption is that rejected applications can be ignored and that the “missingness” of this data from the training dataset can be classified as missing at random. Supporters of this method point to the simplicity of the implementation, clear assumptions, and good empirical results. Others argue that the rejected applications cannot be dismissed simply as random missing data and thus should not be ignored.
  • Hard cut-off method: In this method, a model is first trained using only accepted application data. This trained model is then used to predict the probabilities of charge-off for the rejected applications. A cut-off value is then chosen. Hypothetical loans from rejected applications with probabilities higher than this cut-off value are considered charged off. Hypothetical loans from the remaining applications are assumed to be repaid. The specified model is then re-trained using a dataset including both accepted and rejected applications.
  • Fuzzy augmentation: Similar to the hard cut-off method, fuzzy augmentation begins by training the model on accepted applications only. The resulting model with estimated coefficients is then used to predict charge-off probabilities for rejected applications. Data from rejected applications is then duplicated and a repaid or charged-off status is assigned to each. The specified model is then retrained on the augmented dataset—including accepted applications and the duplicated rejects. Each rejected application is weighted by either a) the predicted probability of charge-off if its assigned status is “charged-off,” or b) the predicted probability of it being repaid if its assigned status is “repaid.”
  • Parceling: The parceling method resembles the hard cut-off method. However, rather than classifying all rejects above a certain threshold as charged-off, this method classifies the repayment status in proportion to the expected “bad” rate (charge-off frequency) at that score. The predicted charge-off probabilities are partitioned into k intervals. Then, for each interval, an assumption is made about the bad rate, and loan applications in each interval are assigned a repayment status randomly according to the bad rate. Bad rates are assumed to be higher in the reject dataset than among the accepted loans. This method considers the missingness to be not at random (MNAR), which requires the modeler to supplement the additional information about the distribution of charge-offs among rejects.

Proportion of Accepted Applications to Rejects

An institution with a relatively high percentage of rejected applications will necessarily end up with an augmented training dataset whose quality is heavily dependent on the quality of the selected reject inference method and its implementation. One might argue it is best to limit the proportion of rejected applications to acceptances. The level at which such a cap is established should reflect the “confidence” in the method used. Estimating such a confidence level, however, is a highly subjective endeavor.

The Proportion of Bad Rates for Accepts and Rejects

It is reasonable to assume that the “bad rate,” i.e., proportion of charged-off loans to repaid loans, will be higher among rejected applications. Some modelers set a threshold based on their a priori belief that the bad rate among rejects is at least p-times the bad rate among acceptances. If the selected reject inference method produces a dataset with a bad rate that is perceived to be artificially low, actions are taken to increase the bad rate above some threshold. Identifying where to establish this threshold is notoriously difficult to justify.

Variable Selection

As outlined above, most approaches begin by estimating a preliminary model based on accepted applications only. This model is then used to infer how rejected loans would have performed. The preliminary model is then retrained on a dataset consisting both of actual data from accepted applications and of the inferred data from rejects. This means that the underlying variables themselves are selected based only on the actual loan performance data from accepted applications. The statistical significance of the selected variables might change, however, when moving to the complete dataset. Variable selection is sometimes redone using the complete data. This, however, can lead to overfitting.

Measuring Model Performance

From a model validator’s perspective, an ideal solution would involve creating a control group in which applications would not be scored and filtered and every application would be accepted. Then the discriminating power of a credit model could be assessed by comparing the charge-off rate of the control group with the charge-off rate of the loans accepted by the model. This approach of extending credit indiscriminately is impractical, however, as it would require the lender to engage in some degree of irresponsible lending.

Another approach is to create a test set. The dilemma here is whether to include only accepted applications. A test set that includes only accepted applications will not necessarily reflect the population for which the model will be used. Including rejected applications, however, obviously necessitates the use of reject inference. For all the reasons laid out above, this approach risks overstating the model’s performance due to the fact that a similar model (trained only on the accepted cases) was used for reject inference.

A third approach that avoids both of these problems involves using information criteria such as AIC and BIC. This, however, is useful only when comparing different models (for model or variable selection). The values of information criteria cannot be interpreted as an absolute measure of performance.

A final option is to consider utilizing several models in production (the main model and challenger models). Under this scenario, each application would be evaluated by a model selected at random. The models can then be compared retroactively by calculating their bad rates on accepted application after the financed loans mature. Provided that the accept rates are similar, the model with the lowest bad rate is the best.


Reject inference remains a progressing field in credit modeling. Its ability to improve model performance is still the subject of intense debate. Current results suggest that while reject inference can improve model performance, its application can also lead to overfitting, thus worsening the ability to generalize. The lack of a strong theoretical basis for reject inference methods means that applications of reject inference need to rely on empirical results. Thus, if reject inference is used, key model stakeholders need to possess a deep understanding of the modeled population, have strong domain knowledge, emphasize conducting experiments to justify the applied modeling techniques, and, above all, adopt and follow a solid ongoing monitoring plan.

Doing this will result in a modeling methodology that is most likely to produce reliable outputs for the institutions while also satisfying MRM and validator requirements.

Contact Us



Get Started
Get A Demo

Linkedin    Twitter    Facebook