Data Management Archives

RiskSpan a Winner of 2022 HousingWire’s Tech100 Mortgage Award

RiskSpan named to HousingWire’s Tech100 for a fourth consecutive year — recognition of the firm’s continuous commitment to advancing mortagage, technology, data and analytics.

Our cloud-native data and predictive modeling analytical platform uncovers insights and mitigates risks for loans and structured products.

SCHEDULE A DEMO OR TRIAL

HousingWire is the most influential source of news and information for the U.S. mortgage and housing markets. Built on a foundation of independent and original journalism, HousingWire reaches over 60,000 newsletter subscribers daily and over 1.0 million unique visitors each month. Our audience of mortgage, real estate and fintech professionals rely on us to Move Markets Forward.

Mortgage Data and the Cloud – Now is the Time

As the trend toward cloud computing continues its march across an ever-expanding set of industries, it is worth pausing briefly to contemplate how it can benefit those of us who work with mortgage data for a living.

The inherent flexibility, efficiency and scalability afforded by cloud-native systems driving this trend are clearly of value to users of financial services data. Mortgages in particular, each accompanied by a dizzying array of static and dynamic data about borrower incomes, employment, assets, property valuations, payment histories, and detailed loan terms, stand to reap the benefits of cloud and the shift to this new form of computing.

And yet, many of my colleagues still catch themselves referring to mortgage data files as “tapes.”

Migrating to cloud evokes some of the shiniest words in the world of computing – cost reduction, security, reliability, agility – and that undoubtedly creates a stir. Cloud’s ability to provide on-demand access to servers, storage locations, databases, software and applications via the internet, along with the promise to ‘only pay for what you use’ further contributes to its popularity.

These benefits are especially well suited to mortgage data. They include:

On-demand self-service and the ability to provision resources without human interference – of particular use for mortgage portfolios that are constantly changing in both size and composition.
Broad network access, diverse platforms having access to multiple resources available over the network – valuable when origination, secondary marketing, structuring, servicing, and modeling tools are seeking to simultaneously access the same evolving datasets for different purposes.
Multi-tenancy and resource pooling, allowing resource sharing while maintaining privacy and security.
Rapid elasticity and scalability, quick acquiring and disposing of resources and allowing quick but measured scaling based on demand.

Cloud-native systems reduce ownership and operational expenses, increase speed and agility, facilitate innovation, improve client experience, and even enhance security controls.

There is nothing quite like mortgage portfolios when it comes to massive quantities of financial data, often PII-laden, with high security requirements. The responsibility for protecting borrower privacy is the most frequently cited reason for financial institution reluctance when it comes to cloud adoption. But perhaps counterintuitively, migrating on-premises applications to cloud actually results in a more controlled environment as it provides for backup and access protocols that are not as easily implemented with on-premise solutions.

The cloud affords a sophisticated and more efficient way of securing mortgage data. In addition to eliminating costs associated with running and maintaining data centers, the cloud enables easy and fast access to data and applications anywhere and at any time. As remote work takes hold as a more long-term norm, cloud-native platform help ensure employees can work effectively regardless of their location. Furthermore, the scalability of cloud-native data centers allows holders of mortgage assets to grow and expand storage capabilities as the portfolio grows and reduce it when it contracts. The cloud protects mortgage data from security breaches or disaster events, because the loan files are (by definition) backed up in a secure, remote location and easily restored without having to invest in expensive data retrieval methods.

This is not to say that migrating to the cloud is without its challenges. Entrusting sensitive data to a new third-party partner and relying on its tech to remain online will always carry some measure of risk. Cloud computing, like any other innovation, comes with its own advantages and disadvantages, and redundancies mitigate virtually all of these uncertainties. Ultimately, the upside of being able work with mortgage data on cloud-native solutions far outweighs the drawbacks. The cloud makes it possible for processes to become more efficient in real-time, without having to undergo expensive hardware enhancements. This in turn creates a more productive environment for data analysts and modelers seeking to give portfolio managers, servicers, securitizers, and others who routinely deal with mortgage assets the edge they are looking for.

Kriti Asrani is an associate data analyst at RiskSpan.

Want to read more on this topic? Check out COVID-19 and the Cloud.

Anomaly Detection and Quality Control

In our most recent workshop on Anomaly Detection and Quality Control (Part I), we discussed how clean market data is an integral part of producing accurate market risk results. As incorrect and inconsistent market data is so prevalent in the industry, it is not surprising that the U.S. spends over $3 trillion on processes to identify and correct market data.

In taking a step back, it is worth noting what drives accurate market risk analytics. Clearly, having accurate portfolio holdings with correct terms and conditions for over-the-counter trades is central to calculating consistent risk measures that are scaled to the market value of the portfolio. The use of well-tested and integrated industry-standard pricing models is another key factor in producing reliable analytics. In comparison to the two categories above, clean, and consistent market data are the largest contributors that could lead to poor market risk analytics. The key driving factor behind detecting and correcting/transforming market data is risk and portfolio managers expectation that risk results are accurate at the start of the business day with no need to perform any time-consuming re-runs during the day to correct issues found.

Broadly defined, market data is defined as any data that is used as input to the re-valuation models. This includes equity prices, interest rates, credit spreads. FX rates, volatility surfaces, etc.

Market data needs to be:

Complete – no true gaps when looking back historically.
Accurate
Consistent – data must be viewed across other data points to determine its accuracy (e.g., interest rates across tenor buckets, volatilities across volatility surface)

Anomaly types can be broken down into four major categories:

Spikes
Stale data
Missing data
Inconsistencies

Here are three example of “bad” market data:

Credit Spreads

The following chart depicts day-over-day changes in credit spreads for the 10-year consumer cyclical time series, returned from an external vendor. The changes indicate a significant spike on 12/3 that caused big swings, up and down, across multiple rating buckets. Without an adjustment to this data, key risk measures would show significant jumps, up and down, depending on the dollar value of positions on two consecutive days.

Swaption Volatilities

Market data also includes volatilities, which drive delta and possible hedging. The following chart shows implied swaption volatilities for different maturities of swaptions and their underlying swaps. The following chart shows implied swaption volatilities for different maturity of swaption and underlying swap. Note the spikes in 7×10 and 10×10 swaptions. The chart also highlights inconsistencies between different tenors and maturities.

Equity Implied Volatilities

The 146 and 148 strikes in the table below reflect inconsistent vol data, as often occurs around expiration.

The detection of market data inconsistencies needs to be an automated process with multiple approaches targeted for specific types of market data. The detection models need to evolve over time as added information is gathered with the goal of reducing false negatives to a manageable level. Once the models detect the anomalies, the next step is to automate the transformation of the market data (e.g., backfill, interpolate, use prior day value). Together with the transformation, transparency must be recorded such that it is known what values were either changed or populated if not available. This should be shared with clients which could lead to alternative transformations or model detection routines.

Detector types typically fall into the following categories:

Extreme Studentized Deviate (ESD): finds outliers in a single data series (helpful for extreme cases.)
Level Shift: detects change in level by comparing means of two sliding time windows (useful for local outliers.)
Local Outliers: detects spikes in near values.
Seasonal Detector: detects seasonal patterns and anomalies (used for contract expirations and other events.)
Volatility Shift: detects shift of volatility by tracking changes in standard deviation.

On Wednesday, May 19th, we will present a follow-up workshop focusing on:

Coding examples
- Application of outlier detection and pipelines
- PCA
Specific loan use cases
- Loan performance
- Entity correction
Novelty Detection
- Anomalies are not always “bad”
- Market monitoring models

You can register for this complimentary workshop here.

Leveraging ML to Enhance the Model Calibration Process

Last month, we outlined an approach to continuous model monitoring and discussed how practitioners can leverage the results of that monitoring for advanced analytics and enhanced end-user reporting. In this post, we apply this idea to enhanced model calibration.

Continuous model monitoring is a key part of a modern model governance regime. But testing performance as part of the continuous monitoring process has value that extends beyond immediate governance needs. Using machine learning and other advanced analytics, testing results can also be further explored to gain a deeper understanding of model error lurking within sub-spaces of the population.

Below we describe how we leverage automated model back-testing results (using our machine learning platform, Edge Studio) to streamline the calibration process for our own residential mortgage prepayment model.

The Problem:

MBS prepayment models, RiskSpan’s included, often provide a number of tuning knobs to tweak model results. These knobs impact the various components of the S-curve function, including refi sensitivity, turnover lever, elbow shift, and burnout factor.

The knob tuning and calibration process is typically messy and iterative. It usually involves somewhat-subjectively selecting certain sub-populations to calibrate, running back-testing to see where and how the model is off, and then tweaking knobs and rerunning the back-test to see the impacts. The modeler may need to iterate through a series of different knob selections and groupings to figure out which combination best fits the data. This is manually intensive work and can take a lot of time.

As part of our continuous model monitoring process, we had already automated the process of generating back-test results and merging them with actual performance history. But we wanted to explore ways of taking this one step further to help automate the tuning process — rerunning the automated back-testing using all the various permutations of potential knobs, but without all the manual labor.

The solution applies machine learning techniques to run a series of back-tests on MBS pools and automatically solve for the set of tuners that best aligns model outputs with actual results.

We break the problem into two parts:

Find Cohorts: Cluster pools into groups that exhibit similar key pool characteristics and model error (so they would need the same tuners).

TRAINING DATA: Back-testing results for our universe of pools with no model tuning knobs applied

Solve for Tuners: Minimize back-testing error by optimizing knob settings.

TRAINING DATA: Back-testing results for our universe of pools under a variety of permutations of potential tuning knobs (Refi x Turnover)

Tuning knobs validation: Take optimized tuning knobs for each cluster and rerun pools to confirm that the selected permutation in fact returns the lowest model errors.

Part 1: Find Cohorts

We define model error as the ratio of the average modeled SMM to the average actual SMM. We compute this using back-testing results and then use a hierarchical clustering algorithm to cluster the data based on model error across various key pool characteristics.

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by either merging or splitting observations successively. The hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the root cluster that contains all samples, while the leaves represent clusters with only one sample. [1]

Agglomerative clustering is an implementation of hierarchical clustering that takes the bottom-up approach (merging approach). Each observation starts in its own cluster, and clusters are then successively merged together. There are multiple linkage criteria that could be chosen from. We have used Ward linkage criteria.

Ward linkage strategy minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach.[2]

Machine Learning-Clustering Results–5-10-2021

Part 2: Solving for Tuners

Here our training data is expanded to be a set of back-test results to include multiple results for each pool under different permutations of tuning knobs.

Process to Optimize the Tuners for Each Cluster

Training Data: Rerun the back-test with permutations of REFI and TURNOVER tunings, covering all reasonably possible combinations of tuners.

These permutations of tuning results are fed to a multi-output regressor, which trains the machine learning model to understand the interaction between each tuning parameter and the model as a fitting step.
- Model Error and Pool Features are used as Independent Variables
- Gradient Tree Boosting/Gradient Boosted Decision Trees (GBDT)* methods are used to find the optimized tuning parameters for each cluster of pools derived from the clustering step
- Two dependent variables — Refi Tuner and Turnover Tuner – are used
- Separate models are estimated for each cluster
We solve for the optimal tuning parameters by running the resulting model with a model error ratio of 1 (no error) and the weighted average cluster features.

* Gradient Tree Boosting/Gradient Boosted Decision Trees (GBDT) is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. When a decision tree is a weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of arbitrary differentiable loss function. [3]

*We used scikit-learn’s GBDT implementation to optimize and solve for best Refi and Turnover tuner. [4]

Machine Learning-Optimizing Tuning Knobs–5-10-2021

Results

The resultant suggested knobs show promise in improving model fit over our back-test period. Below are the results for two of the clusters using the knobs that suggested by the process. To further expand the results, we plan to cross-validate on out-of-time sample data as it comes in.

Machine Learning-Validation of Tuning Results C7–5-10-2021

Conclusion

These advanced analytics show promise in their ability to help streamline the model calibration and tuning process by removing many of the time-consuming and subjective components from the process altogether. Once a process like this is established for one model, applying it to new populations and time periods becomes more straightforward. This analysis can be further extended in a number of ways. One in particular we’re excited about is the use of ensemble models—or a ‘model of models’ approach. We will continue to tinker with this approach as we calibrate our own models and keep you apprised on what we learn.

Appendix

[1]. https://en.wikipedia.org/wiki/Hierarchical_clustering

[2]. https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

[3]. https://en.wikipedia.org/wiki/Gradient_boosting

[4]. https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

Too Many Documentation Types? A Data-Driven Approach to Consolidating Them

The sheer volume of different names assigned to various documentation types in the non-agency space has really gotten out of hand, especially in the last few years. As of February 2021, an active loan in the CoreLogic RMBS universe could have any of over 250 unique documentation type names, with little or no standardization from issuer to issuer. Even within a single issuer, things get complicated when every possible permutation of the same basic documentation level gets assigned its own type. One issuer in the database has 63 unique documentation names!

In order for investors to be able to understand and quantify their exposure, we need a way of consolidating and mapping all these different documentation types to a simpler, standard nomenclature. Various industry reports attempt to group all the different documentation levels into meaningful categories. But these classifications often fail to capture important distinctions in delinquency performance among different documentation levels.

There is a better way. Taking some of the consolidated group names from the various industry papers and rating agency papers as a starting point, we took another pass focusing on two main elements:

The delinquency performance of the group. We focused on the 60-DPD rate while also considering other drivers of loan performance (e.g., DTI, FICO, and LTV) and their correlation to the various doc type groups.
The size of the sub-segment. We ensured our resulting groupings were large enough to be meaningful.

What follows is how we thought about it and ultimately landed where we did. These mappings are not set in stone and will likely need to undergo revisions as 1) new documentation types are generated, and 2) additional performance data and feedback from clients on what they consider most important become available. Releasing these mappings into RiskSpan’s Edge Platform will then make it easier for users to track performance.

Data Used

We take a snapshot of all loans outstanding in non-agency RMBS issued after 2013, as of the February 2021 activity period. The data comes from CoreLogic and we exclude loans in seasoned or reperforming deals. We also exclude loans whose documentation type is not reported, some 14 percent of the population.

Approach

We are seeking to create sub-groups that generally conform to the high-level groups on which the industry seems to be converging while also identifying subdivisions with meaningfully different delinquency performance. We will rely on these designations as we re-estimate our credit model.

Steps in the process:

Start with high-level groupings based on how the documentation type is currently named.
- Full Documentation: Any name referencing ‘Agency,’ ‘Agency AUS,’ or similar.
- Bank Statements: Any name including the term “Bank Statement[s].”
- Investor/DSCR: Any name indicating that the underwriting relied on net cash flows to the secured property.
- Alternative Documentation: A wide-ranging group consolidating many different types, including: asset qualifier, SISA/SIVA/NINA, CPA letters, etc.
- Other: Any name that does not easily classify into one of the groups above, such as Foreign National Income, and any indecipherable names.

Chart

We subdivided the Alternative Documentation group by some of the meaningfully sized natural groupings of the names:
- Asset Depletion or Asset Qualifier
- CPA and P&L statements
- Salaried/Wage Earner: Includes anything with W2 tax return
- Tax Returns or 1099s: Includes anything with ‘1099’ or ‘Tax Return, but not ‘W2.’
- Alt Doc: Anything that remained, included items like ‘VIVA, ‘SISA,’ ‘NINA,’ ‘Streamlined,’ ‘WVOE,’ and ‘Alt Doc.’

From there we sought to identify any sub-groups that perform differently (as measured by 60-DPD%).
- Bank Statement: We evaluated a subdivision by the number of statements provided (less than 12 months, 12 months, and greater than 12 months). However, these distinctions did not significantly impact delinquency performance. (Also, very few loans fell into the under 12 months group.) Distinguishing ‘Business Bank Statement’ loans from the general ‘Bank Statements’ category, however, did yield meaningful performance differences.

- Alternative Documentation: This group required the most iteration. We initially focused our attention on documentation types that included terms like ‘streamlined’ or ‘fast.’ This, however, did not reveal any meaningful performance differences relative to other low doc loans. We also looked at this group by issuer, hypothesizing that some programs might perform better than others. The jury is still out on this analysis and we continue to track it. The following subdivisions yielded meaningful differences:
  - Limited Documentation: This group includes any names including the terms ‘reduced,’ ‘limited,’ ‘streamlined,’ and ‘alt doc.’ This group performed substantially better than the next group.
  - No Doc/Stated: Not surprisingly, these were the worst performers in the ‘Alt Doc’ universe. The types included here are a throwback to the run-up to the housing crisis. ‘NINA,’ ‘SISA,’ ‘No Doc,’ and ‘Stated’ all make a reappearance in this group.
  - Loans with some variation of ‘WVOE’ (written verification of employment) showed very strong performance, so much so that we created an entirely separate group for them.

Full Documentation: Within the variations of ‘Full Documentation’ was a whole sub-group with qualifying terms attached. Examples include ‘Full Doc 12 Months’ or ‘Full w/ Asset Assist.’ These full-doc-with-qualification loans were associated with higher delinquency rates. The sub-groupings reflect this reality:
- - Full Documentation: Most of the straightforward types indicating full documentation, including anything with ‘Agency/AUS.’
  - Full with Qualifications (‘Full w/ Qual’): Everything including the term ‘Full’ followed by some sort of qualifier.
Investor/DSCR: The sub-groups here either were not big enough or did not demonstrate sufficient performance difference.
Other: Even though it’s a small group, we broke out all the ‘Foreign National’ documentation types into a separate group to conform with other industry reporting.

Among the challenges of this sort of analysis is that the combinations to explore are virtually limitless. Perhaps not surprisingly, most of the potential groupings we considered did not make it into our final mapping. Some of the cuts we are still looking at include loan purpose with respect to some of the alternative documentation types.

We continue to evaluate these and other options. We can all agree that 250 documentation types is way too many. But in order to be meaningful, the process of consolidation cannot be haphazard. Fortunately, the tools for turning sub-grouping into a truly data-driven process are available. We just need to use them.

RiskSpan a Winner of HousingWire’s RiskTech100 Award

For the third consecutive year, RiskSpan is a winner of HousingWire’s prestigious annual HW Tech100 Mortgage award, recognizing the most innovative technology companies in the housing economy.

The recognition is the latest in a parade of 2021 wins for the data and analytics firm whose unique blend of tech and talent enables traders and portfolio managers to transact quickly and intelligently to find opportunities. RiskSpan’s comprehensive solution also provides risk managers access to modeling capabilities and seamless access to the timely data they need to do their jobs effectively.

“I’ve been involved in choosing Tech100 winners since we started the program in 2014, and every year it manages to get more competitive,” HousingWire Editor and Chief Sarah Wheeler said. “These companies are truly leading the way to a more innovative housing market!”

Other major awards collected by RiskSpan and its flagship Edge Platform in 2021 include winning Chartis Research’s “Risk as a Service” category and being named “Buy-side Market Risk Management Product of the Year” by Risk.net.

RiskSpan’s cloud-native Edge platform is valued by users seeking to run structured products analytics fast and granularly. It provides a one-stop shop for models and analytics that previously had to be purchased from multiple vendors. The platform is supported by a first-rate team, most of whom come from industry and have walked in the shoes of our clients.

“After the uncertainty and unpredictability of last year, we expected a greater adoption of technology. However, these 100 real estate and mortgage companies took digital disruption to a whole new level and propelled a complete digital revolution, leaving a digital legacy that will impact borrowers, clients and companies for years to come,” said Brena Nath, HousingWire’s HW+ Managing Editor. ”Knowing what these companies were able to navigate and overcome, we’re excited to announce this year’s list of the most innovative technology companies serving the mortgage and real estate industries.”

Get in touch with us to explore why RiskSpan is a best-in-class partner for data and analytics in mortgage and structured finance.

Is Free Public Data Worth the Cost?

No such thing as a free lunch.

The world is full of free (and semi-free) datasets ripe for the picking. If it’s not going to cost you anything, why not supercharge your data and achieve clarity where once there was only darkness?

But is it really not going to cost you anything? What is the total cost of ownership for a public dataset, and what does it take to distill truly valuable insights from publicly available data? Setting aside the reliability of the public source (a topic for another blog post), free data is anything but free. Let us discuss both the power and the cost of working with public data.

To illustrate the point, we borrow from a classic RiskSpan example: anticipating losses to a portfolio of mortgage loans due to a hurricane—a salient example as we are in the early days of the 2020 hurricane season (and the National Oceanic and Atmospheric Administration (NOAA) predicts a busy one). In this example, you own a portfolio of loans and would like to understand the possible impacts to that portfolio (in terms of delinquencies, defaults, and losses) of a recent hurricane. You know this will likely require an external data source because you do not work for NOAA, your firm is new to owning loans in coastal areas, and you currently have no internal data for loans impacted by hurricanes.

Know the Data.

The first step in using external data is understanding your own data. This may seem like a simple task. But data, its source, its lineage, and its nuanced meaning can be difficult to communicate inside an organization. Unless you work with a dataset regularly (i.e., often), you should approach your own data as if it were provided by an external source. The goal is a full understanding of the data, the data’s meaning, and the data’s limitations, all of which should have a direct impact on the types of analysis you attempt.

Understanding the structure of your data and the limitations it puts on your analysis involves questions like:

What objects does your data track?
Do you have time series records for these objects?
Do you only have the most recent record? The most recent 12 records?
Do you have one record that tries to capture life-to-date information?

Understanding the meaning of each attribute captured in your data involves questions like:

What attributes are we tracking?
Which attributes are updated (monthly or quarterly) and which remain static?
What are the nuances in our categorical variables? How exactly did we assign the zero-balance code?
Is original balance the loan’s balance at mortgage origination, or the balance when we purchased the loan/pool?
Do our loss numbers include forgone interest?

These same types of questions also apply to understanding external data sources, but the answers are not always as readily available. Depending on the quality and availability of the documentation for a public dataset, this exercise may be as simple as just reading the data dictionary, or as labor intensive as generating analytics for individual attributes, such as mean, standard deviation, mode, or even histograms, to attempt to derive an attribute’s meaning directly from the delivered data. This is the not-free part of “free” data, and skipping this step can have negative consequences for the quality of analysis you can perform later.

Returning to our example, we require at least two external data sets:

where and when hurricanes have struck, and
loan performance data for mortgages active in those areas at those times.

The obvious choice for loan performance data is the historical performance datasets from the GSEs (Fannie Mae and Freddie Mac). Providing monthly performance information and loss information for defaulted loans for a huge sample of mortgage loans over a 20-year period, these two datasets are perfect for our analysis. For hurricanes, some manual effort is required to extract date, severity, and location from NOAA maps like these (you could get really fancy and gather zip codes covered in the landfall area—which, by leaving out homes hundreds of miles away from expected landfall, would likely give you a much better view of what happens to loans actually impacted by a hurricane—but we will stick to state-level in this simple example).

Make new data your own.

So you’ve downloaded the historical datasets, you’ve read the data dictionaries cover-to-cover, you’ve studied historical NOAA maps, and you’ve interrogated your own data teams for the meaning of internal loan data. Now what? This is yet another cost of “free” data: after all your effort to understand and ingest the new data, all you have is another dataset. A clean, well-understood, well-documented (you’ve thoroughly documented it, haven’t you?) dataset, but a dataset nonetheless. Getting the insights you seek requires a separate effort to merge the old with the new. Let us look at a simplified flow for our hurricane example:

Subset the GSE data for active loans in hurricane-related states in the month prior to landfall. Extract information for these loans for 12 months after landfall.
Bucket the historical loans by the characteristics you use to bucket your own loans (LTV, FICO, delinquency status before landfall, etc.).
Derive delinquency and loss information for the buckets for the 12 months after the hurricane.
Apply the observed delinquency and loss information to your loan portfolio (bucketed using the same scheme you used for the historical loans).

And there you have it—not a model, but a grounded expectation of loan performance following a hurricane. You have stepped out of the darkness and into the data-driven light. And all using free (or “free”) data!

Hyperbole aside, nothing about our example analysis is easy, but it plainly illustrates the power and cost of publicly available data. The power is obvious in our example: without the external data, we have no basis for generating an expectation of losses after a hurricane. While we should be wary of the impacts of factors not captured by our datasets (like the amount and effectiveness of government intervention after each storm – which does vary widely), the historical precedent we find by averaging many storms can form the basis for a robust and defensible expectation. Even if your firm has had experience with loans in hurricane-impacted areas, expanding the sample size through this exercise bolsters confidence in the outcomes. Generally speaking, the use of public data can provide grounded expectations where there had been only anecdotes.

But this power does come at a price—a price that should be appreciated and factored into the decision whether to use external data in the first place. What is worse than not knowing what to expect after a hurricane? Having an expectation based on bad or misunderstood data. Failing to account for the effort required to ingest and use free data can lead to bad analysis and the temptation to cut corners. The effort required in our example is significant: the GSE data is huge, complicated, and will melt your laptop’s RAM if you are not careful. Turning NOAA PDF maps into usable data is not a trivial task, especially if you want to go deeper than the state level. Understanding your own data can be a challenge. Applying an appropriate bucketing to the loans can make or break the analysis. Not all public datasets present these same challenges, but all public datasets present costs. There simply is no such thing as a free lunch. The returns on free data frequently justify these costs. But they should be understood before unwittingly incurring them.

Webinar: Data Analytics and Modeling in the Cloud – June 24th

On Wednesday, June 24th, at 1:00 PM EDT, join Suhrud Dagli, RiskSpan’s co-founder and chief innovator, and Gary Maier, managing principal of Fintova for a free RiskSpan webinar.

Suhrud and Gary will contrast the pros and cons of analytic solutions native to leading cloud platforms, as well as tips for ensuring data security and managing costs.

Click here to register for the webinar.

Webinar: Using Machine Learning in Whole Loan Data Prep

webinar

Using Machine Learning in Whole Loan Data Prep

Tackle one of your biggest obstacles: Curating and normalizing multiple, disparate data sets.

Learn from RiskSpan experts:

How to leverage machine learning to help streamline whole loan data prep
Innovative ways to manage the differences in large data sets
How to automate ‘the boring stuff’

About The Hosts

LC Yarnelle

Director – RiskSpan

LC Yarnelle is a Director with experience in financial modeling, business operations, requirements gathering and process design. At RiskSpan, LC has worked on model validation and business process improvement/documentation projects. He also led the development of one of RiskSpan’s software offerings, and has led multiple development projects for clients, utilizing both Waterfall and Agile frameworks. Prior to RiskSpan, LC was as an analyst at NVR Mortgage in the secondary marketing group in Reston, VA, where he was responsible for daily pricing, as well as on-going process improvement activities. Before a career move into finance, LC was the director of operations and a minority owner of a small business in Fort Wayne, IN. He holds a BA from Wittenberg University, as well as an MBA from Ohio State University.

Matt Steele

Senior Analyst – RiskSpan

Residential Mortgage REIT: End to End Loan Data Management and Analytics

An inflexible, locally installed risk management system with dated technology required a large IT staff to support it and was incurring high internal maintenance costs.

Absent a single solution, the use of multiple vendors for pricing and risk analytics, prepay/credit models and data storage created inefficiencies in workflow and an administrative burden to maintain.

Inconsistent data and QC across the various sources was also creating a number of data integrity issues.

The Solution

An end-to-end data and risk management solution. The REIT implemented RiskSpan’s Edge Platform, which provides value, cost and operational efficiencies.

Scalable, cloud-native technology
Increased flexibility to run analytics at loan level; additional interactive / ad-hoc analytics
Reliable, accurate data with more frequent updates

Deliverables

Consolidating from five vendors down to a single platform enabled the REIT to streamline workflows and automate processes, resulting in a 32% annual cost savings and 46% fewer resources required for maintenance.

1 234 5

RiskSpan a Winner of 2022 HousingWire’s Tech100 Mortgage Award

Mortgage Data and the Cloud – Now is the Time

Anomaly Detection and Quality Control

Leveraging ML to Enhance the Model Calibration Process

Too Many Documentation Types? A Data-Driven Approach to Consolidating Them

RiskSpan a Winner of HousingWire’s RiskTech100 Award

Is Free Public Data Worth the Cost?

Webinar: Data Analytics and Modeling in the Cloud – June 24th

Webinar: Using Machine Learning in Whole Loan Data Prep

webinar

About The Hosts

LC Yarnelle

Matt Steele

Residential Mortgage REIT: End to End Loan Data Management and Analytics

Company

Products

Security & Compliance