Machine Learning for Mortgage Data Normalization

Data is at the core of everything RiskSpan does—from model builds and validations to mining and visualization projects. Because all this work, to one degree or another, relies on data as its “raw materials,” some amount of data normalization and cleaning is normally the first task on any project we undertake. Consequently, efforts to expand and refine RiskSpan’s Edge platform rely on a variety of activities designed to clean, expand, and validate the data the platform uses as its raw material. Once the data is clean, attributes need to be normalized across datasets to allow users to move expeditiously among datasets without needing to learn a whole new set of definitions. An ever-expanding list of datasets and asset classes presents an increasingly complex task of preparing the data and documenting the underlying processes in a way that is transparent and meaningful to the end user. A forthcoming series of blog posts will outline a few of the techniques we use to normalize and document the data we ingest onto the Edge platform. These techniques extend beyond rules-based E-T-Ls and spreadsheet-style data dictionaries. They include implementations of machine learning and graph databases. Each method is deployed to help solve a specific issue, and so it is useful to discuss them individually, in the context of the problem each is trying to solve. This blog series will reference a generic, loan-level dataset of single family mortgages. The relative simplicity of the dataset itself will enable us to examine the various ways in which it interacts with our existing mortgage loan datasets—including the complex factors driving its similarities and differences.  

Problem 1: Standard E-T-L Mapping

Mapping the source location to the target location lies at the core of any E-T-L operation—sometimes it is the only translation being done. In our mortgage loan example, this would equate to finding an attribute among the raw data that should be understood and saved as Original UPB (unpaid principal balance).  Normally the job of business analysts, this mapping exercise can be both tedious and time consuming. RiskSpan supplements these analyst efforts with machine learning models that work to identify Original UPB in each new dataset. By presenting analysts with a prediction about which raw attribute represents Original UPB, the models significantly reduce the time it takes to onboard a new dataset, while also reducing the number of mapping errors. In a separate blog post, this system’s author will discuss the cluster of models we use to perform this work and walk through an example of the tool at work on our sample mortgage loan dataset.  

Problem 2: Unknown Formatting of Numeric Values

Knowing which attribute represents a loan’s Original UPB is only half the battle. Understanding how data is formatted is equally important. A $100,000.00 Original UPB is represented as 0000100000.00 by Fannie Mae, while Ginnie Mae represents it as 00010000000. This formatting difference presents two significant issues: 1) the obvious order of magnitude problem—Fannie Mae reports the balance in dollars, while Ginnie Mae reports it in cents—and 2) the risk of overlooking the problem entirely because both values can be easily converted to a plausible dollar amount. Problem #2 is more concerning, because a structurally sound E-T-L, even one declaring data types (as is a best practice), will miss this and pass inaccurate data to downstream processes (in the form of $10,000,000 mortgage loans). Reasonableness tests for each attribute (written by business analysts) are a common way of alerting E-T-L users to potential issues. RiskSpan has supplemented reasonableness tests by developing an algorithm to detect and automatically correct order-of-magnitude issues like this. Consisting of two parts, the algorithm compares the data it receives against an expectation for the field. Because the algorithm is responsible for maintaining its own reasonableness expectations for each field (and updates the expectation with each successive dataset it processes) the system actually learns what each field should look like in terms of order of magnitude. This enables the system to improve its handling of data through time. The algorithm’s author will explain how it learns and expand on this Original UPB example in a forthcoming blog post.

Problem 3: Documentation

Documenting the complexity described above can be a project in itself. Onboarding new data to Edge creates an increasingly complex network of data sources and destinations, all of which must be meticulously accounted for. To track this web, we have deployed graph database. Both queryable and interactive, the graph database is the perfect tool for tracking an interconnected system, and serves as a user-friendly data dictionary. In a forthcoming post, the graph database’s author will explain the structure of the graph, and present examples of how to extract information from the database.