We met with RiskSpan’s Head of Engineering and Development, Praveen Vairavan, to understand how his team set about optimizing analytics computational processing for a portfolio of 4 million mortgage loans using a cloud-based compute farm.

This interview dives deeper into a case study we discussed in a recent interview with RiskSpan’s co-founder, Suhrud Dagli.

Here is what we learned from Praveen. 


Speak to an Expert

Could you begin by summarizing for us the technical challenge this optimization was seeking to overcome? 

PV: The main challenge related to an investor’s MSR portfolio, specifically the volume of loans we were trying to run. The client has close to 4 million loans spread across nine different servicers. This presented two related but separate sets of challenges. 

The first set of challenges stemmed from needing to consume data from different servicers whose file formats not only differed from one another but also often lacked internal consistency. By that, I mean even the file formats from a single given servicer tended to change from time to time. This required us to continuously update our data mapping and (because the servicer reporting data is not always clean) modify our QC rules to keep up with evolving file formats.  

The second challenge relates to the sheer volume of compute power necessary to run stochastic paths of Monte Carlo rate simulations on 4 million individual loans and then discount the resulting cash flows based on option adjusted yield across multiple scenarios. 

And so you have 4 million loans times multiple paths times one basic cash flow, one basic option-adjusted case, one up case, and one down case, and you can see how quickly the workload adds up. And all this needed to happen on a daily basis. 

To help minimize the computing workload, our client had been running all these daily analytics at a rep-line level—stratifying and condensing everything down to between 70,000 and 75,000 rep lines. This alleviated the computing burden but at the cost of decreased accuracy because they couldn’t look at the loans individually. 

What technology enabled you to optimize the computational process of running 50 paths and 4 scenarios for 4 million individual loans?

PV: With the cloud, you have the advantage of spawning a bunch of servers on the fly (just long enough to run all the necessary analytics) and then shutting it down once the analytics are done. 

This sounds simple enough. But in order to use that level of compute servers, we needed to figure out how to distribute the 4 million loans across all these different servers so they can run in parallel (and then we get the results back so we could aggregate them). We did this using what is known as a MapReduce approach. 

Say we want to run a particular cohort of this dataset with 50,000 loans in it. If we were using a single server, it would run them one after the other – generate all the cash flows for loan 1, then for loan 2, and so on. As you would expect, that is very time-consuming. So, we decided to break down the loans into smaller chunks. We experimented with various chunk sizes. We started with 1,000 – we ran 50 chunks of 1,000 loans each in parallel across the AWS cloud and then aggregated all those results.  

That was an improvement, but the 50 parallel jobs were still taking longer than we wanted. And so, we experimented further before ultimately determining that the “sweet spot” was something closer to 5,000 parallel jobs of 100 loans each. 

Only in the cloud is it practical to run 5,000 servers in parallel. But this of course raises the question: Why not just go all the way and run 50,000 parallel jobs of one loan each? Well, as it happens, running an excessively large number of jobs carries overhead burdens of its own. And we found that the extra time needed to manage that many jobs more than offset the compute time savings. And so, using a fair bit of trial and error, we determined that 100-loan jobs maximized the runtime savings without creating an overly burdensome number of jobs running in parallel.  

Get A Demo

You mentioned the challenge of having to manage a large number of parallel processes. What tools do you employ to work around these and other bottlenecks? 

PV: The most significant bottleneck associated with this process is finding the “sweet spot” number of parallel processes I mentioned above. As I said, we could theoretically break it down into 4 million single-loan processes all running in parallel. But managing this amount of distributed computation, even in the cloud, invariably creates a degree of overhead which ultimately degrades performance. 

And so how do we find that sweet spot – how do we optimize the number of servers on the distributed computation engine? 

As I alluded to earlier, the process involved an element of trial and error. But we also developed some home-grown tools (and leveraged some tools available in AWS) to help us. These tools enable us to visualize computation server performance – how much of a load they can take, how much memory they use, etc. These helped eliminate some of the optimization guesswork.   

Is this optimization primarily hardware based?

PV: AWS provides essentially two “flavors” of machines. One “flavor” enables you to take in a lot of memory. This enables you to keep a whole lot of loans in memory so it will be faster to run. The other flavor of hardware is more processor based (compute intensive). These machines provide a lot of CPU power so that you can run a lot of processes in parallel on a single machine and still get the required performance. 

We have done a lot of R&D on this hardware. We experimented with many different instance types to determine which works best for us and optimizes our output: Lots of memory but smaller CPUs vs. CPU-intensive machines with less (but still a reasonably amount of) memory. 

We ultimately landed on a machine with 96 cores and about 240 GB of memory. This was the balance that enabled us to run portfolios at speeds consistent with our SLAs. For us, this translated to a server farm of 50 machines running 70 processes each, which works out to 3,500 workers helping us to process the entire 4-million-loan portfolio (across 50 Monte Carlo simulation paths and 4 different scenarios) within the established SLA.  

What software-based optimization made this possible? 

PV: Even optimized in the cloud, hardware can get pricey – on the order of $4.50 per hour in this example. And so, we supplemented our hardware optimization with some software-based optimization as well. 

We were able to optimize our software to a point where we could use a machine with just 30 cores (rather than 96) and 64 GB of RAM (rather than 240). Using 80 of these machines running 40 processes each gives us 2,400 workers (rather than 3,500). Software optimization enabled us to run the same number of loans in roughly the same amount of time (slightly faster, actually) but using fewer hardware resources. And our cost to use these machines was just one-third what we were paying for the more resource-intensive hardware. 

All this, and our compute time actually declined by 10 percent.  

The software optimization that made this possible has two parts: 

The first part (as we discussed earlier) is using the MapReduce methodology to break down jobs into optimally sized chunks. 

The second part involved optimizing how we read loan-level information into the analytical engine.  Reading in loan-level data (especially for 4 million loans) is a huge bottleneck. We got around this by implementing a “pre-processing” procedure. For each individual servicer, we created a set of optimized loan files that can be read and rendered “analytics ready” very quickly. This enables the loan-level data to be quickly consumed and immediately used for analytics without having to read all the loan tapes and convert them into a format that analytics engine can understand. Because we have “pre-processed” all this loan information, it is immediately available in a format that the engine can easily digest and run analytics on.  

This software-based optimization is what ultimately enabled us to optimize our hardware usage (and save time and cost in the process).  

Contact us to learn more about how we can help you optimize your mortgage analytics computational processing.