RiskSpan, Author at RiskSpan

Big Companies; Big Data Issues

Data issues plague organizations of all sorts and sizes. But generally, the bigger the dataset, and the more transformations the data goes through, the greater the likelihood of problems. Organizations take in data from many different sources, including social media, third-party vendors and other structured and unstructured origins, resulting in massive and complex data storage and management challenges. This post presents ideas to keep in mind when seeking to address these.

First, a couple of definitions:

Data quality generally refers to the fitness of a dataset for its purpose in a given context. Data quality encompasses many related aspects, including:

Accuracy,
Completeness,
Update status,
Relevance,
Consistency across data sources,
Reliability,
Appropriateness of presentation, and
Accessibility

Data lineage tracks data movement, including its origin and where it moves over time. Data lineage can be represented visually to depict how data flows from its source to its destination via various changes and hops.

The challenges facing many organizations relate to both data quality and data lineage issues, and a considerable amount of time and effort is spent both in tracing the source of data (i.e., its lineage) and correcting errors (i.e., ensuring its quality). Business intelligence and data visualization tools can do a magnificent job of teasing stories out of data, but these stories are only valuable when they are true. It is becoming increasingly vital to adopt best practices to ensure that the massive amounts of data feeding downstream processes and presentation engines are both reliable and properly understood.

Financial institutions must frequently deal with disparate systems either because of mergers and acquisitions or in order to support different product types—consumer lending, commercial banking and credit cards, for example. Disparate systems tend to result in data silos, and substantial time and effort must go into providing compliance reports and meeting the various regulatory requirements associated with analyzing data provenance (from source to destination). Understanding the workflow of data and access controls around security are also vital applications of data lineage and help ensure data quality.

In addition to the obvious need for financial reporting accuracy, maintaining data lineage and quality is vital to identifying redundant business rules and data and to ensuring that reliable, analyzable data is constantly available and accessible. It also helps to improve the data governance echo system, enabling data owners to focus on gleaning business insights from their data rather than focusing attention on rectifying data issues.

Common Data Lineage Issues

A surprising number of data issues emerge simply from uncertainty surrounding a dataset’s provenance. Many of the most common data issues stem from one or more of the following categories:

Human error: “Fat fingering” is just the tip of the iceberg. Misconstruing and other issues arising from human intervention are at the heart of virtually all data issues.
Incomplete Data: Whether it’s drawing conclusions based on incomplete data or relying on generalizations and judgment to fill in the gaps, many data issues are caused by missing data.
Data format: Systems expect to receive data in a certain format. Issues arise when the actual input data departs from these expectations.
Data consolidation: Migrating data from legacy systems or attempting to integrate newly acquired data (from a merger, for instance) frequently leads to post-consolidation issues.
Data processing: Calculation engines, data aggregators, or any other program designed to transform raw data into something more “usable” always run the risk of creating output data with quality issues.

Addressing Issues

Issues relating to data lineage and data quality are best addressed by employing some combination of the following approaches. The specific blend of approaches depends on the types of issues and data in question, but these principles are broadly applicable.

Employing a top-down discovery approach enables data analysts to understand the key business systems and business data models that drive an application. This approach is most effective when logical data models are linked to the physical data and systems.

Creating a rich metadata repository for all the data elements flowing from the source to destination can be an effective way of heading off potential data lineage issues. Because data lineage is dependent on the metadata information, creating a robust repository from the outset often helps preserve data lineage throughout the life cycle.

Imposing useful data quality rules is an important element in establishing a framework in which data is always validated against a set of well-conceived business rules. Ensuring not only that data passes comprehensive rule sets but also that remediation factors are in place for appropriately dealing with data that fails quality control checks is crucial for ensuring end-to-end data quality.

Data lineage and data quality both require continuous monitoring by a defined stewardship council to ensure that data owners are taking appropriate steps to understand and manage the idiosyncrasies of the datasets they oversee.

Our Data Lineage and Data Quality Background

RiskSpan’s diverse client base includes several large banks (with we define as banks with assets totaling in excess of $50 billion). Large banks are characterized by a complicated web of departments and sub-organizations, each offering multiple products, sometimes to the same base of customers. Different sub-organizations frequently rely on disparate systems (sometimes due to mergers/acquisitions; sometimes just because they develop their businesses independent of one another). Either way, data silos inevitably result.

RiskSpan has worked closely with chief data officers of large banks to help establish data stewardship teams charged with taking ownership of the various “areas” of data within the bank. This involves the identification of data “curators” within each line of business to coordinate with the CDO’s office and be the advocate (and ultimately the responsible party) for the data they “own.” In best practice scenarios, a “data curator” group is formed to facilitate collaboration and effective communication for data work across the line of business.

We have found that a combination of top-down and bottom-up data discovery approaches is most effective when working accross stakeholders to understand existing systems and enterprise data assets. RiskSpan has helped create logical data flow diagrams (based on the top-down approach) and assisted with linking physical data models to the logical data models. We have found Informatica and Collibra tools to be particularly useful in creating data lineage, tracking data owners, and tracing data flow from source to destination.

Complementing our work with financial clients to devise LOB-based data quality rules, we have built data quality dashboards using these same tools to enable data owners and curators to rectify and monitor data quality issues. These projects typically include elements of the following components.

Initial assessment review of the current data landscape.
Establishment of a logical data flow model using both top-down and bottom-up data discovery approaches.
Coordination with the CDO / CIO office to set up a data governance stewardship team and to identify data owners and curators from all parts of the organization.
Delineation of data policies, data rules and controls associated with different consumers of the data.
Development of a target state model for data lineage and data quality by outlining the process changes from a business perspective.
Development of future-state data architecture and associated technology tools for implementing data lineage and data quality.
Invitation to client stakeholders to reach a consensus related to future-state model and technology architecture.
Creation of a project team to execute data lineage and data quality projects by incorporating the appropriate resources and client stakeholders.
Development of a change management and migration strategy to enable users and stakeholders to use data lineage and data quality tools.

Ensuring data quality and lineage is ultimately the responsibility of business lines that own and use the data. Because “data management” is not the principal aim of most businesses, it often behooves them to leverage the principles outlined in this post (sometimes along with outside assistance) to implement tactics that will to help ensure that the stories their data tell are reliable.

Houston Strong: Communities Recover from Hurricanes. Do Mortgages?

The 2017 hurricane season devastated individual lives, communities, and entire regions. As one would expect, dramatic increases in mortgage delinquencies accompanied these events. But the subsequent recoveries are a testament both to the resilience of the people living in these areas and to relief mechanisms put into place by the mortgage holders.

Now, nearly a year later, we wanted to see what the credit-risk transfer data (as reported by Fannie Mae CAS and Freddie Mac STACR) could tell us about how these borrowers’ mortgage payments are coming along.

The timing of the hurricanes’ impact on mortgage payments can be approximated by identifying when Current-to-30 days past due (DPD) roll rates began to spike. Barring other major macroeconomic events, we can reasonably assume that most of this increase is directly due to hurricane-related complications for the borrowers.

The effect of the hurricanes is clear—Puerto Rico, the U.S. Virgin Islands, and Houston all experienced delinquency spikes in September. Puerto Rico and the Virgin Islands then experienced a second wave of delinquencies in October due to Hurricanes Irma and Maria.

But what has been happening to these loans since entering delinquency? Have they been getting further delinquent and eventually defaulting, or are they curing? We focus our attention on loans in Houston (specifically the Houston-The Woodlands-Sugar Land Metropolitan Statistical Area) and Puerto Rico because of the large number of observable mortgages in those areas.

First, we look at Houston. Because the 30-DPD peak was in September, we track that bucket of loans. To help us understand the path 30-DPD might reasonably be expected to take, we compared the Houston delinquencies to 30-DPD loans in the 48 states other than Texas and Florida.

Of this group of loans in Houston that were 30 DPD in September, we see that while many go on to be 60+ DPD in October, over time this cohort is decreasing in size.

Recovery is slower than the non-hurricane-affected U.S. loans, but persistent. The biggest difference is that a significant number of 30-day delinquencies in the rest of the country loans continue to hover at 30 DPD (rather than curing or progressing to 60 DPD) while the Houston cohort is more evenly split between the growing number loans that cure and the shrinking number of loans progressing to 60+ DPD.

Puerto Rico (which experienced its 30 DPD peak in October) shows a similar trend:

To examine loans even more affected by the hurricanes, we can perform the same analysis on loans that reached 60 DPD status.

Here, Houston’s peak is in October while Puerto Rico’s is in November.

Houston vs. the non-hurricane-affected U.S.:

Puerto Rico vs. the non-hurricane-affected U.S.:

In both Houston and Puerto Rico, we see a relatively small 30-DPD cohort across all months and a growing Current cohort. This indicates many people paying their way to Current from 60+ DPD status. Compare this to the rest of the US where more people pay off just enough to become 30 DPD, but not enough to become Current.

The lack of defaults in post-hurricane Houston and Puerto Rico can be explained by several relief mechanisms Fannie Mae and Freddie Mac have in place. Chiefly, disaster forbearance gives borrowers some breathing room with regards to payment. The difference is even more striking among loans that were 90 days delinquent, where eventual default is not uncommon in the non-hurricane affected U.S. grouping:

And so, both 30-DPD and 60-DPD loans in Houston and Puerto Rico proceed to more serious levels of delinquency at a much lower rate than similarly delinquent loans in the rest of the U.S. To see if this is typical for areas affected by hurricanes of a similar scale, we looked at Fannie Mae loan-level performance data for the New Orleans MSA after Hurricane Katrina in August 2005.

As the following chart illustrates, current-to-30 DPD roll rates peaked in New Orleans in the month following the hurricane:

What happened to these loans?

Here we see a relatively speedy recovery, with large decreases in the number of 60+ DPD loans and a sharp increase in prepayments. Compare this to non-hurricane affected states over the same period, where the number of 60+ DPD loans held relatively constant, and the number of prepayments grew at a noticeably slower rate than in New Orleans.

The remarkable number of prepayments in New Orleans was largely due to flood insurance payouts, which effectively prepay delinquent loans. Government assistance lifted many others back to current. As of March, we do not see this behavior in Houston and Puerto Rico, where recovery is moving much more slowly. Flood insurance incidence rates are known to have been low in both areas, a likely suspect for this discrepancy.

While loans are clearly moving out of delinquency in these areas, it is at a much slower rate than the historical precedent of Hurricane Katrina. In the coming months we can expect securitized mortgages in Houston and Puerto Rico to continue to improve, but getting back to normal will likely take longer than what was observed in New Orleans following Katrina. Of course, the impending 2018 hurricane season may complicate this matter.

—————————————————————————————————————-

Note: The analysis in this blog post was developed using RiskSpan’s Edge Platform. The RiskSpan Edge Platform is a module-based data management, modeling, and predictive analytics software platform for loans and fixed-income securities. Click here to learn more.

MDM to the Rescue for Financial Institutions

Data becomes an asset only when it is efficiently harnessed and managed. Because firms tend to evolve into silos, their data often gets organized that way as well, resulting in multiple references and unnecessary duplication of data that dilute its value. Master Data Management (MDM) architecture helps to avoid these and other pitfalls by applying best practices to maximize data efficiency, controls, and insights.

MDM has particular appeal to banks and other financial institutions where non-integrated systems often make it difficult to maintain a comprehensive, 360-degree view of a customer who simultaneously has, for example, multiple deposit accounts, a mortgage, and a credit card. MDM provides a single, common data reference across systems that traditionally have not communicated well with each other. Customer-level reports can point to one central database instead of searching for data across multiple sources.

Financial institutions also derive considerable benefit from MDM when seeking to comply with regulatory reporting requirements and when generating reports for auditors and other examiners. Mobile banking and the growing number of new payment mechanisms make it increasingly important for financial institutions to have a central source of data intelligence. An MDM strategy enables financial institutions to harness their data and generate more meaningful insights from it by:

Eliminating data redundancy and providing one central repository for common data;
Cutting across data “silos” (and different versions of the same data) by providing a single source of truth;
Streamlining compliance reporting (through the use of a common data source);
Increasing operational and business efficiency;
Providing robust tools to secure and encrypt sensitive data;
Providing a comprehensive 360-degree view of customer data;
Fostering data quality and reducing the risks associated with stale or inaccurate data, and;
Reducing operating costs associated with data management.

Not surprisingly, there’s a lot to think about when contemplating and implementing a new MDM solution. In this post, we lay out some of the most important things for financial institutions to keep in mind.

MDM Choice and Implementation Priorities

MDM is only as good as the data it can see. To this end, the first step is to ensure that all of the institution’s data owners are on board. Obtaining management buy-in to the process and involving all relevant stakeholders is critical to developing a viable solution. This includes ensuring that everyone is “speaking the same language”—that everyone understands the benefits related to MDM in the same way—and establishing shared goals across the different business units.

Once all the relevant parties are on board, it’s important to identify the scope of the business process within the organization that needs data refinement through MDM. Assess the current state of data quality (including any known data issues) within the process area. Then, identify all master data assets related to the process improvement. This generally involves identifying all necessary data integration for systems of record and the respective subscribing systems that would benefit from MDM’s consistent data. The selected MDM solution should be sufficiently flexible and versatile that it can govern and link any sharable enterprise data and connect to any business domain, including reference data, metadata and any hierarchies.

An MDM “stewardship team” can add value to the process by taking ownership of the various areas within the MDM implementation plan. MDM is just not about technology itself but also involves business and analytical thinking around grouping data for efficient usage. Members of this team need to have the requisite business and technical acumen in order for MDM implementation to be successful. Ideally this team would be responsible for identifying data commonalities across groups and laying out a plan for consolidating them. Understanding the extent of these commonalities helps to optimize architecture-related decisions.

Architecture-related decisions are also a function of how the data is currently stored. Data stored in heterogeneous legacy systems calls for a different sort of MDM solution than does a modern data lake architecture housing big data. The solutions should be sufficiently flexible and scalable to support future growth. Many tools in the marketplace offer MDM solutions. Landing on the right tool requires a fair amount of due diligence and analysis. The following evaluation criteria are often helpful:

Enterprise Integration: Seamless integration into the existing enterprise set of tools and workflows is an important consideration for an MDM solution. Solutions that require large-scale customization efforts tend to carry additional hidden costs.
Support for Multiple Devices: Because modern enterprise data must by consumable by a variety of devices (e.g., desktop, tablet and mobile) the selected MDM architecture must support each of these platforms and have multi-device access capability.
Cloud and Scalability: With most of today’s technology moving to the cloud, an MDM solution must be able to support a hybrid environment (cloud as well as on-premise). The architecture should be sufficiently scalable to accommodate seasonal and future growth.
Security and Compliance: With cyber-attacks becoming more prevalent and compliance and regulatory requirements continuing to proliferate, the MDM architecture must demonstrate capabilities in these areas.

Start Small; Build Gradually; Measure Success

MDM implementation can be segmented into small, logical projects based on business units or departments within an organization. Ideally, these projects should be prioritized in such a way that quick wins (with obvious ROI) can be achieved in problem areas first and then scaling outward to other parts of the organization. This sort of stepwise approach may take longer overall but is ultimately more likely to be successful because it demonstrates success early and gives stakeholders confidence about MDM’s benefits.

The success of smaller implementations is easier to measure and see. A small-scale implementation also provides immediate feedback on the technology tool used for MDM—whether it’s fulfilling the needs as envisioned. The larger the implementation, the longer it takes to know whether the process is succeeding or failing and whether alternative tools should be pursued and adopted. The success of the implementation can be measured using the following criteria:

Savings on data storage—a result of eliminating data redundancy.
Increased ease of data access/search by downstream data consumers.
Enhanced data quality—a result of common data centralization.
More compact data lineage across the enterprise—a result of standardizing data in one place.

Practical Case Studies

RiskSpan has helped several large banks consolidate multiple data stores across different lines of business. Our MDM professionals work across heterogeneous data sets and teams to create a common reference data architecture that eliminates data duplication, thereby improving data efficiency and reducing redundant data. These professionals have accomplished this using a variety of technologies, including Informatica, Collibra and IBM Infosphere.

Any successful project begins with a survey of the current data landscape and an assessment of existing solutions. Working collaboratively to use this information to form the basis of an approach for implementing a best-practice MDM strategy is the most likely path to success.

Making Data Dictionaries Beautiful Using Graph Databases

Most analysts estimate that for a given project well over half of the time is spent on collecting, transforming, and cleaning data in preparation for analysis. This task is generally regarded as one of the least appetizing portions of the data analysis process and yet it is the most crucial, as trustworthy analyses are borne out of clean, reliable data. Gathering and preparing data for analysis can be either enhanced or hindered based on the data management practices in place at a firm. When data are readily available, clearly defined, and well documented it will lead to faster and higher-quality insights. As the size and variability of data grows, however, so too does the challenge of storing and managing it. Like many firms, RiskSpan manages a multitude of large, complex datasets with varying degrees of similarity and connectedness. To streamline the analysis process and improve the quantity and quality of our insights, we have made our datasets, their attributes, and relationships transparent and quickly accessible using graph database technology. Graph databases differ significantly from traditional relational databases because data are not stored in tables. Instead, data are stored in either a node or a relationship (also called an edge), which is a connection between two nodes. The image below contains a grey node labeled as a dataset and a blue node labeled as a column. The line connecting these two nodes is a relationship which, in this instance, signifies that the dataset contains the column. Graph 1 There are many advantages to this data structure including decreased redundancy. Rather than storing the same “Column1” in multiple tables for each dataset that contain it (as you would in a relational database), you can simply create more relationships between the datasets demonstrated below: Graph 2 With this flexible structure it is possible to create complex schema that remain visually intuitive. In the image below the same grey (dataset) -contains-> blue (column) format is displayed for a large collection of datasets and columns. Even at such a high level, the relationships between datasets and columns reveal patterns about the data. Here are three quick observations:

In the top right corner there is a dataset with many unique columns.
There are two datasets that share many columns between them and have limited connectivity to the other datasets.
Many ubiquitous columns have been pulled to the center of the star pattern via the relationships to the multiple datasets on the outer rim.

Graph 3 In addition to containing labels, nodes can store data as key-value pairs. The image below displays the column “orig_upb” from dataset “FNMA_LLP”, which is one of Fannie Mae’s public datasets that is available on RiskSpan’s Edge Platform. Hovering over the column node displays some information about it, including the name of the field in the RiskSpan Edge platform, its column type, format, and data type. Graph 4 Relationships can also store data in the same key-value format. This is an incredibly useful property which, for the database in this example, can be used to store information specific to a dataset and its relationship to a column. One of the ways in which RiskSpan has utilized this capability is to hold information pertinent to data normalization in the relationships. To make our datasets easier to analyze and combine, we have normalized the formats and values of columns found in multiple datasets. For example, the field “loan_channel” has been mapped from many unique inputs across datasets to a set of standardized values. In the images below, the relationships between two datasets and loan_channel are highlighted. The relationship key-value pairs contain a list of “mapped_values” identifying the initial values from the raw data that have been transformed. The dataset on the left contains the list: [“BROKER”, “CORRESPONDENT”, “RETAIL”] Graph 5 While the dataset on the right contains: [“R”, “B”, “C”, “T”, “9”] Graph 6 We can easily merge these lists with a node containing a map of all the recognized enumerations for the field. This central repository of truth allows us to deploy easy and robust changes to the ETL processes for all datasets. It also allows analysts to easily query information related to data availability, formats, and values. Graph 7 In addition to queries specific to a column, this structure allows an analyst to answer questions about data availability across datasets with ease. Normally, comparing pdf data dictionaries, excel worksheets, or database tables can be a painstaking process. Using the graph database, however, a simple query can return the intersection of three datasets as shown below. The resulting graph is easy to analyze and use to define the steps required to obtain and manipulate the data. Graph 8 In addition to these benefits for analysts and end users, utilizing graph database technology for data management comes with benefits from a data governance perspective. Within the realm of data stewardship, ownership and accountability of datasets can be assigned and managed within a graph database like the one in this blog. The ability to store any attribute in a node and create any desired relationship makes it simple to add nodes representing data owners and curators connected to their respective datasets. Graph 9 The ease and transparency with which any data related information can be stored makes graph databases very attractive. Graph databases can also support a nearly infinite number of nodes and relationships while also remaining fast. While every technology has a learning curve, the intuitive nature of graphs combined with their flexibility makes them an intriguing and viable option for data management.

Applying Machine Learning to Conventional Model Validations

In addition to transforming the way in which financial institutions approach predictive modeling, machine learning techniques are beginning to find their way into how model validators assess conventional, non-machine-learning predictive models. While the array of standard statistical techniques available for validating predictive models remains impressive, the advent of machine learning technology has opened new avenues of possibility for expanding the rigor and depth of insight that can be gained in the course of model validation. In this blog post, we explore how machine learning, in some circumstances, can supplement a model validator’s efforts related to:

Outlier detection on model estimation data
Clustering of data to better understand model accuracy
Feature selection methods to determine the appropriateness of independent variables
The use of machine learning algorithms for benchmarking
Machine learning techniques for sensitivity analysis and stress testing

Outlier Detection

Conventional model validations include, when practical, an assessment of the dataset from which the model is derived. (This is not always practical—or even possible—when it comes to proprietary, third-party vendor models.) Regardless of a model’s design and purpose, virtually every validation concerns itself with at least a cursory review of where these data are coming from, whether their source is reliable, how they are aggregated, and how they figure into the analysis.

Conventional model validation techniques sometimes overlook (or fail to look deeply enough at) the question of whether the data population used to estimate the model is problematic. Outliers—and the effect they may be having on model estimation—can be difficult to detect using conventional means. Developing descriptive statistics and identifying data points that are one, two, or three standard deviations from the mean (i.e., extreme value analysis) is a straightforward enough exercise, but this does not necessarily tell a modeler (or a model validator) which data points should be excluded.

Machine learning modelers use a variety of proximity and projection methods for filtering outliers from their training data. One proximity method employs the K-means algorithm, which groups data into clusters centered around defined “centroids,” and then identifies data points that do not appear to belong to any particular cluster. Common projection methods include multi-dimensional scaling, which allows analysts to view multi-dimensional relationships among multiple data points in just two or three dimensions. Sophisticated model validators can apply these techniques to identify dataset problems that modelers may have overlooked.

Data Clustering

The tendency of data to cluster presents another opportunity for model validators. Machine learning techniques can be applied to determine the relative compactness of individual clusters and how distinct individual clusters are from one another. Clusters that do not appear well defined and blur into one another are evidence of a potentially problematic dataset—one that may result in non-existent patterns being identified in random data. Such clustering could be the basis of any number of model validation findings.

Feature (Variable) Selection

What conventional predictive modelers typically refer to as variables are commonly referred to by machine learning modelers as features. Features and variables serve essentially the same function, but the way in which they are selected can differ. Conventional modelers tend to select variables using a combination of expert judgment and statistical techniques. Machine learning modelers tend to take a more systematic approach that includes stepwise procedures, criterion-based procedures, lasso and ridge regresssion and dimensionality reduction. These methods are designed to ensure that machine learning models achieve their objectives in the simplest way possible, using the fewest possible number of features, and avoiding redundancy. Because model validators frequently encounter black-box applications, directing applying these techniques is not always possible. In some limited circumstances, however, model validators can add to the robustness of their validations by applying machine learning feature selection methods to determine whether conventionally selected model variables resemble those selected by these more advanced means (and if not, why not).

Benchmarking Applications

Identifying and applying an appropriate benchmarking model can be challenging for model validators. Commercially available alternatives are often difficult to (cost effectively) obtain, and building challenger models from scratch can be time-consuming and problematic—particularly when all they do is replicate what the model in question is doing.

While not always feasible, building a machine learning model using the same data that was used to build a conventionally designed predictive model presents a “gold standard” benchmarking opportunity for assessing the conventionally developed model’s outputs. Where significant differences are noted, model validators can investigate the extent to which differences are driven by data/outlier omission, feature/variable selection, or other factors.

Sensitivity Analysis and Stress Testing

The sheer quantity of high-dimensional data very large banks need to process in order to develop their stress testing models makes conventional statistical analysis both computationally expensive and problematic. (This is sometimes referred to as the “curse of dimensionality.”) Machine learning feature selection techniques, described above, are frequently useful in determining whether variables selected for stress testing models are justifiable.

Similarly, machine learning techniques can be employed to isolate, in a systematic way, those variables to which any predictive model is most and least sensitive. Model validators can use this information to quickly ascertain whether these sensitivities are appropriate. A validator, for example, may want to take a closer look at a credit model that is revealed to be more sensitive to, say, zip code, than it is to credit score, debt-to-income ratio, loan-to-value ratio, or any other individual variable or combination of variables. Machine learning techniques make it possible for a model validator to assess a model’s relative sensitivity to virtually any combination of features and make appropriate judgments.

————————–

Model validators have many tools at their disposal for assessing the conceptual soundness, theory, and reliability of conventionally developed predictive models. Machine learning is not a substitute for these, but its techniques offer a variety of ways of supplementing traditional model validation approaches and can provide validators with additional tools for ensuring that models are adequately supported by the data that underlies them.

Applying Model Validation Principles to Machine Learning Models

Machine learning models pose a unique set of challenges to model validators. While exponential increases in the availability of data, computational power, and algorithmic sophistication in recent years has enabled banks and other firms to increasingly derive actionable insights from machine learning methods, the significant complexity of these systems introduces new dimensions of risk.

When appropriately implemented, machine learning models greatly improve the accuracy of predictions that are vital to the risk management decisions financial institutions make. The price of this accuracy, however, is complexity and, at times, a lack of transparency. Consequently, machine learning models must be particularly well maintained and their assumptions thoroughly understood and vetted in order to prevent wildly inaccurate predictions. While maintenance remains primarily the responsibility of the model owner and the first line of defense, second-line model validators increasingly must be able to understand machine learning principles well enough to devise effective challenge that includes:

Analysis of model estimation data to determine the suitability of the machine learning algorithm
Assessment of space and time complexity constraints that inform model training time and scalability
Review of model training/testing procedure
Determination of whether model hyperparameters are appropriate
Calculation of metrics for determining model accuracy and robustness

More than one way exists of organizing these considerations along the three pillars of model validation. Here is how we have come to think about it.

Conceptual Soundness

Many of the concepts of reviewing model theory that govern conventional model validations apply equally well to machine learning models. The question of “business fit” and whether the variables the model lands on are reasonable is just as valid when the variables are selected by a machine as it is when they are selected by a human analyst. Assessing the variable selection process “qualitatively” (does it make sense?) as well as quantitatively (measuring goodness of fit by calculating residual errors, among other tests) takes on particular importance when it comes to machine learning models.

Machine learning does not relieve validators of their responsibility assess the statistical soundness of a model’s data. Machine learning models are not immune to data issues. Validators protect against these by running routine distribution, collinearity, and related tests on model datasets. They must also ensure that the population has been appropriately and reasonably divided into training and holdout/test datasets.

Supplementing these statistical tests should be a thorough assessment of the modeler’s data preparation procedures. In addition to evaluating the ETL process—a common component of all model validations—effective validations of machine learning models take particular notice of variable “scaling” methods. Scaling is important to machine learning algorithms because they generally do not take units into account. Consequently, a machine learning model that relies on borrower income (generally ranging between tens of thousands and hundreds of thousands of dollars), borrower credit score (which generally falls within a range of a few hundred points) and loan-to-value ratio (expressed as a percentage), needs to apply scaling factors to normalize these ranges in order for the model to correctly process each variable’s relative importance. Validators should ensure that scaling and normalizations are reasonable.

Model assumptions, when it comes to machine learning validation, are most frequently addressed by looking at the selection, optimization, and tuning of the model’s hyperparameters. Validators must determine whether the selection/identification process undertaken by the modeler (be it grid search, random search, Bayesian Optimization, or another method—see this blog post for a concise summary of these) is conceptually sound.

Process Verification

Machine learning models are no more immune to overfitting and underfitting (the bias-variance dilemma) than are conventionally developed predictive models. An overfitted model may perform well on the in-sample data, but predict poorly on the out-of-sample data. Complex nonparametric and nonlinear methods used in machine learning algorithms combined with high computing power are likely to contribute to an overfitted machine learning model. An underfitted model, on the other hand, performs poorly in general, mainly due to an overly simplified model algorithm that does a poor job at interpreting the information contained within data.

Cross-validation is a popular technique for detecting and preventing the fitting or “generalization capability” issues in machine learning. In K-Fold cross-validation, the training data is partitioned into K subsets. The model is trained on all training data except the Kth subset, and the Kth subset is used to validate the performance. The model’s generalization capability is low if the accuracy ratios are consistently low (underfitted) or higher on the training set but lower on the validation set (overfitted). Conventional models, such as regression analysis, can be used to benchmark performance.

Outcomes Analysis

Outcomes analysis enables validators to verify the appropriateness of the model’s performance measure methods. Performance measures (or “scoring methods”) are typically specialized to the algorithm type, such as classification and clustering. Validators can try different scoring methods to test and understand the model’s performance. Sensitivity analyses can be performed on the algorithms, hyperparameters, and seed parameters. Since there is no right or wrong answer, validators should focus on the dispersion of the sensitivity results.

Many statistical tactics commonly used to validate conventional models apply equally well to machine learning models. One notable omission is the ability to precisely replicate the model’s outputs. Unlike with an OLS or ARIMA model, for which a validator can reasonably expect to be able to match the model’s coefficients exactly if given the same data, machine learning models can be tested only indirectly—by testing the conceptual soundness of the selected features and assumptions (hyperparameters) and by evaluating the process and outputs. Applying model validation tactics specially tailored to machine learning models allows financial institutions to deploy these powerful tools with greater confidence by demonstrating that they are of sound conceptual design and perform as expected.

RiskSpan Director David Andrukonis Featured on The Purposeful Banker Podcast

RiskSpan’s CECL Soution Director David Andrukonis was a featured guest on PrecisionLender’s podcast, The Purposeful Banker in their recent episode titled “Is your Bank Ready for CECL”

David summarized the major takeaways from a recent CECL conference, including regulator signals of forthcoming capital relief and emerging practices around reasonable and supportable forecast period length (16:19); outlined how RiskSpan is helping banks prepare for the new accounting standard (3:47); and offered ways that banks can stay current on continuing CECL developments (23:42).

You can listen to the entire episode of the podcast on their SoundCloud account:

The Surging Reverse Mortgage Market

Momentum continues to build around reverse mortgages and related products. Persistent growth in both home prices and the senior population has stoked renewed interest and discussion about the most appropriate uses of accumulated home equity in financial planning strategies. A common and superficial way to think of reverse mortgages is as a “last-resort” means of covering expenses when more conventional planning tools prove insufficient. But experts increasingly are not thinking of reverse mortgages in this way. Last week, the American College of Financial Services and the Bipartisan Policy Center hosted the 2018 Housing Wealth in Retirement Symposium. Speakers represented policy research think tanks, institutional asset managers, large banks, and AARP. Notwithstanding the diversity of viewpoints, virtually every speaker reiterated a position that financial planners have posited for years: financial products that leverage home equity should, in many cases, be integrated into comprehensive retirement planning strategies, rather than being reserved as a product of last resort.

Senior Home Equity Continues Trending Upward

The National Reverse Mortgage Lenders Association (NRMLA) and RiskSpan have published the Reverse Mortgage Market Index (RMMI) since the beginning of 2000. The RMMI provides a trending measure of home equity of U.S. homeowners age 62 and older. The RMMI defines senior home equity as the difference between the aggregate value of homes owned and occupied by seniors and the aggregate mortgage balance secured by those homes. This measure enables the RMMI to help gauge the potential market size of those who may be qualified for a reverse mortgage product. The chart below illustrates the steady increase in this index since the end of the 2008 recession. It reached its latest all-time high in the most recent quarter (Q4 2017). Increasing house prices drive this trend, mitigated to some extent by a corresponding modest increase in mortgage debt held by seniors. The most recent RMMI report is published on NRMLA’s website. As summarized below by the Urban Institute, home equity can be extracted through many mechanisms, primarily Federal Housing Administration (FHA)–insured Home Equity Conversion Mortgages (HECMs), closed-end home equity loans, home equity lines of credit (HELOCs), and cash-out refinancing.

Share of Homeowners Who Extracted Home Equity by Strategy

The Urban Institute research goes on to point out that although few seniors have extracted home equity to date, the market is potentially very large (as reflected by the RMMI index) and more extraction is likely in the years ahead as the senior population both grows and ages. The data in the following chart confirm what one might reasonably expect—that younger seniors are more likely to have existing mortgages than older seniors.

Reverse Mortgage as Retirement Planning Tool

Looking at senior home equity in the context of overall net worth lends support to financial planners’ view of products like reverse mortgages as more than something on which to fall back as a last resort. The first three rows of data in the table below contains the median net worth by age cohort in 2013 and 2016, respectively, from Federal Reserve Board’s Survey of Consumer Finances. The bottom row, highlighted in yellow, is the estimated average senior home equity (total senior home equity as computed by the RMMI divided by senior population) for the same years. We acknowledge the imprecision inherent in this comparison due to the statistical method used (median vs. average) and certain data limitations on RMMI (addressed below). Additionally, the net worth figures may include non-homeowners. Nonetheless, home equity is an unignorably important component of senior net worth.

Following the release of the Federal Reserve’s 2016 Survey of Consumer Finances https://www.federalreserve.gov/econres/scfindex.htm, the Urban Institute published a summary research paper “What the 2016 Survey of Consumer Finances Tells Us about Senior Homeowners” https://www.urban.org/sites/default/files/publication/94526/what-the-2016-survey-of-consumer-finances-tells-us-about-senior-homeowners.pdf in November 2017. The paper notes that “Worries about retirement security are rooted in several factors, such as Social Security changes that shrink the share of preretirement earnings replaced by the program (Munnell and Sundén 2005), rising medical and long-term care costs (Johnson and Mommaerts 2009, 2010), student loan burdens, and the shift from employer-sponsored defined-benefit pension plans that guarantee lifetime income to 401(k)-type defined-contribution plans whose account balances depend on employee contributions and uncertain investment returns (Munnell 2014; Munnell and Sundén 2005). In addition, increased life expectancies require retirement savings to last longer.”

The financial position of seniors is evolving. Forty-one percent of homeowners age 65 and older now have a mortgage on their primary residence, compared with just 21 percent in 1989, and the median outstanding debt has risen from $16,793 to $72,000, according to the Urban Institute. As more households enter retirement with more debt, a growing number will likely tap into their home as a source of income. Hurdles and challenges remain, however, and education will play an important role in fostering responsible use of reverse mortgage products.

Note on the Limitations of RMMI

To calculate the RMMI, an econometric tool is developed to estimate senior housing value, senior mortgage level, and senior equity using data gathered from various public resources such as American Community Survey (ACS), Federal Reserve Flow of Funds (Z.1), and FHFA housing price indexes (HPI). The RMMI is simply the senior equity level at time of measure relative to that of the base quarter in 2000.[1] The main limitation of RMMI is non-consecutive data, such as census population. We use a smoothing approach to estimate data in between the observable periods and continue to look for ways to improve our methodology and find more robust data to improve the precision of the results. Until then, the RMMI and its relative metrics (values, mortgages, home equities) are best analyzed at a trending macro level, rather than at more granular levels, such as MSA.

[1] There was a change in RMMI methodology in Q3 2015 mainly to calibrate senior homeowner population and senior housing values observed in 2013 American Community Survey (ACS).

Machine Learning Detects Model Validation Blind Spots

Machine learning represents the next frontier in model validation—particularly in the credit and prepayment modeling arena. Financial institutions employ numerous models to make predictions relating to MBS performance. Validating these models by assessing their predictions is of paramount importance, but even models that appear to perform well based upon summary statistics can have subsets of input (input subspaces) for which they tend to perform poorly. Isolating these “blind spots” can be challenging using conventional model validation techniques, but recently developed machine learning algorithms are making the job easier and the results more reliable.

High-Error Subspace Visualization

RiskSpan’s modeling team has developed a statistical algorithm which identifies high-error subspaces and flags model outputs corresponding to inputs originating from these subspaces, indicating to model users that the results might be unreliable. An extension to this problem that we also address is whether migration of data points to more error-prone subspaces of the input space over time can be indicative of macroeconomic regime shifts and signal a need to re-estimate the model. This will aid in the prevention of declining model efficacy over time.

Due to the high-dimensional nature of the input spaces of many financial models, traditional statistical methods of partitioning data may prove inadequate. Using machine learning techniques, we have developed a more robust method of high-error subspace identification. We develop the algorithm using loan performance model data, but the method is adaptable to generic models.

Data Selection and Preparation

The dataset we use for our analysis is a random sample of the publicly available Freddie Mac Loan-Level Dataset. The entire dataset covers the monthly loan performance for loans originated from 1999 to 2016 (25.4 million fixed-rate mortgages). From this set, one million loans were randomly sampled. Features of this dataset include loan-to-value ratio, borrower debt-to-income ratio, borrower credit score, interest rate, and loan status, among others. We aggregate the monthly status vectors for each loan into a single vector which contains a loan status time series over the life of the loan within the historical period. This aggregated status vector is mapped to a value of 1 if the time series indicates the loan was ever 90 days delinquent within the first three years after its origination, representing a default, and 0 otherwise. This procedure results in 914,802 total records.

Algorithm Framework

Using the prepared loan dataset, we estimate a logistic regression loan performance model. The data is sampled and partitioned into training and test datasets for clustering analysis. The model estimation and training data is taken from loans originating in the period from 1999 to 2007, while loans originating in the period from 2008 to 2016 are used for testing. Once the data has been partitioned into training and test sets, a clustering algorithm is run on the training data.

Two-Dimensional Visualization of Select Clusters

The clustering is evaluated based upon its ability to stratify the loan data into clusters that meaningfully identify regions of the input for which the model performs poorly. This requires the average model performance error associated with certain clusters to be substantially higher than the mean. After the training data is assigned to clusters, cluster-level error is computed for each cluster using the logistic regression model. Clusters with high error are flagged based upon a scoring scheme. Each loan in the test set is assigned to a cluster based upon its proximity to the training cluster centers. Loans in the test set that are assigned to flagged clusters are flagged, indicating that the loan comes from a region for which loan performance model predictions exhibit lower accuracy.

Algorithm Performance Analysis

The clustering algorithm successfully flagged high-error regions of the input space, with flagged test clusters exhibiting accuracy more than one standard deviation below the mean. The high errors associated with clusters flagged during model training were persistent over time, with flagged clusters in the test set having a model accuracy of just 38.7%, compared to an accuracy of 92.1% for unflagged clusters. Failure to address observed high-error clusters in the training set and migration of data to high-error subspaces led to substantially diminished model accuracy, with overall model accuracy dropping from 93.9% in the earlier period to 84.1% in the later period.

Training/Test Cluster Error Comparison

Additionally, the nature of default misclassifications and variables with greatest impact on misclassification were also determined. Cluster FICO scores proved to be a strong indicator of cluster model prediction accuracy. While a relatively large proportion of loans in low-FICO clusters defaulted, the logistic regression model substantially overpredicted the number of defaults for these clusters, leading to a large number of Type I errors (inaccurate default predictions) for these clusters. Type II (inaccurate non-default predictions) errors constituted a smaller proportion of overall model error, and their impact was diminished even further when considering their magnitude relative to the number of true negative predictions (accurate non-default predictions), which are far fewer in number than true positive predictions (accurate default predictions).

FICO vs. Cluster Accuracy

Conclusion

Our application of the subspace error identification algorithm to a loan performance model illustrates the dangers of using high-level summary statistics as the sole determinant of model efficacy and failure to consistently monitor the statistical profile of model input data over time. Often, more advanced statistical analysis is required to comprehensively understand model performance. The algorithm identified sets of loans for which the model was systematically misclassifying default status. These large-scale errors come at a high cost to financial institutions employing such models.

As an extension to this research into high error subspace detection, RiskSpan is currently developing machine learning analytics tools that can detect the root cause of systematic model errors and suggest ways to enhance predictive model performance by alleviating these errors.

Hands-On Machine Learning–Predicting Loan Delinquency

The ability of machine learning models to predict loan performance makes them particularly interesting to lenders and fixed-income investors. This expanded post provides an example of applying the machine learning process to a loan-level dataset in order to predict delinquency. The process includes variable selection, model selection, model evaluation, and model tuning.

The data used in this example are from the first quarter of 2005 and come from the publicly available Fannie Mae performance dataset. The data are segmented into two different sets: acquisition and performance. The acquisition dataset contains 217,000 loans (rows) and 25 variables (columns) collected at origination (Q1 2005). The performance dataset contains the same set of 217,000 loans coupled with 31 variables that are updated each month over the life of the loan. Because there are multiple records for each loan, the performance dataset contains approximately 16 million rows.

For this exercise, the problem is to build a model capable of predicting which loans will become severely delinquent, defined as falling behind six or more months on payments. This delinquency variable was calculated from the performance dataset for all loans and merged with the acquisition data based on the loan’s unique identifier. This brings the total number of variables to 26. Plenty of other hypotheses can be tested, but this analysis focuses on just this one.

1 Variable Selection

An overview of the dataset can be found below, showing the name of each variable as well as the number of observations available

                                            Count
LOAN_IDENTIFIER                             217088
CHANNEL                                     217088
SELLER_NAME                                 217088
ORIGINAL_INTEREST_RATE                      217088
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB)     217088
ORIGINAL_LOAN_TERM                          217088
ORIGINATION_DATE                            217088
FIRST_PAYMENT_DATE                          217088
ORIGINAL_LOAN-TO-VALUE_(LTV)                217088
ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV)      217074
NUMBER_OF_BORROWERS                         217082
DEBT-TO-INCOME_RATIO_(DTI)                  201580
BORROWER_CREDIT_SCORE                       215114
FIRST-TIME_HOME_BUYER_INDICATOR             217088
LOAN_PURPOSE                                217088
PROPERTY_TYPE                               217088
NUMBER_OF_UNITS                             217088
OCCUPANCY_STATUS                            217088
PROPERTY_STATE                              217088
ZIP_(3-DIGIT)                               217088
MORTGAGE_INSURANCE_PERCENTAGE                34432
PRODUCT_TYPE                                217088
CO-BORROWER_CREDIT_SCORE                    100734
MORTGAGE_INSURANCE_TYPE                      34432
RELOCATION_MORTGAGE_INDICATOR               217088

Most of the variables in the dataset are fully populated, with the exception of DTI, MI Percentage, MI Type, and Co-Borrower Credit Score. Many options exist for dealing with missing variables, including dropping the rows that are missing, eliminating the variable, substituting with a value such as 0 or the mean, or using a model to fill the most likely value.

The following chart plots the frequency of the 34,000 MI Percentage values.

The distribution suggests a decent amount of variability. Most loans that have mortgage insurance are covered at 25%, but there are sizeable populations both above and below. Mortgage insurance is not required for the majority of borrowers, so it makes sense that this value would be missing for most loans. In this context, it makes the most sense to substitute the missing values with 0, since 0% mortgage insurance is an accurate representation of the state of the loan. An alternative that could be considered is to turn the variable into a binary yes/no variable indicating if the loan has mortgage insurance, though this would result in a loss of information.

The next variable with a large number of missing values is Mortgage Insurance Type. Querying the dataset reveals that that of the 34,400 loans that have mortgage insurance, 33,000 have type 1 borrower paid insurance and the remaining 1,400 have type 2 lender paid insurance. Like the mortgage insurance variable, the blank values can be filled. This will change the variable to indicate if the loan has no insurance, type 1, or type 2.

The remaining variable with a significant number of missing values is Co-Borrower Credit Score, with approximately half of its values missing. Unlike MI Percentage, the context does not allow us to substitute missing values with zeroes. The distribution of both borrower and co-borrower credit score as well as their relationship can be found below.

As the plot demonstrates, borrower and co-borrower credit scores are correlated. Because of this, the removal of co-borrower credit score would only result in a minimal loss of information (especially within the context of this example). Most of the variance captured by co-borrower credit score is also captured in borrower credit score. Turning the co-borrower credit score into a binary yes/no ‘has co-borrower’ variable would not be of much use in this scenario as it would not differ significantly from the Number of Borrowers variable. Alternate strategies such as averaging borrower/co-borrower credit score might work, but for this example we will simply drop the variable.

In summary, the dataset is now smaller—Co-Borrower Credit Score has been dropped. Additionally, missing values for MI Percentage and MI Type have been filled in. Now that the data have been cleaned up, the values and distributions of the remaining variables can be examined to determine what additional preprocessing steps are required before model building. Scatter matrices of pairs of variables and distribution plots of individual variables along the diagonal can be found below. The scatter plots are helpful for identifying multicollinearity between pairs of variables, and the distributions can show if a variable lacks enough variance that it won’t contribute to model performance.[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1089″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]The third row of scatterplots, above, reflects a lack of variability in the distribution of Original Loan Term. The variance of 3.01 (calculated separately) is very small, and as a result the variable can be removed—it will not contribute to any model as there is very little information to learn from. This process of inspecting scatterplots and distributions is repeated for the remaining pairs of variables. The Number of Units variable suffers from the same issue and can also be removed.

2 Heatmaps and Pairwise Grids

Matrices of scatterplots are useful for looking at the relationships between variables. Another useful plot is a heatmap and pairwise grid of correlation coefficients. In the plot below a very strong correlation between Original LTV and Original CLTV is identified.

This multicollinearity can be problematic for both the interpretation of the relationship between the variables and delinquency as well as the actual performance of some models. To combat this problem, we remove Original CLTV because Original LTV is a more accurate representation of the loan at origination. Loans in this population that were not refinanced kept their original LTV value as CLTV. If CLTV were included in the model it would introduce information not available at origination to the model. The problem of allowing unexpected additional information in a dataset introduces an issue known as leakage, which will bias the model.

Now that the numeric variables have been inspected, the remaining categorical variables must be analyzed to ensure that the classes are not significantly unbalanced. Count plots and simple descriptive statistics can be used to identify categorical variables are problematic. Two examples below show the count of loans by state and by seller.

Inspecting the remaining variables uncovers that Relocation Indicator (indicating a mortgage issued when an employer moves an employee) and Product Type (fixed vs. adjustable rate) must be removed as they are extremely unbalanced and do not contain any information that will help the models learn. We also removed first payment date and origination date, which were largely redundant. The final cleanup results in a dataset that contains the following columns:

LOAN_IDENTIFIER 
CHANNEL 
SELLER_NAME
ORIGINAL_INTEREST_RATE
ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB) 
ORIGINAL_LOAN-TO-VALUE_(LTV) 
NUMBER_OF_BORROWERS
DEBT-TO-INCOME_RATIO_(DTI) 
BORROWER_CREDIT_SCORE
FIRST-TIME_HOME_BUYER_INDICATOR 
LOAN_PURPOSE
PROPERTY_TYPE 
OCCUPANCY_STATUS 
PROPERTY_STATE
MORTGAGE_INSURANCE_PERCENTAGE 
MORTGAGE_INSURANCE_TYPE 
ZIP_(3-DIGIT)

The final two steps before model building are to standardize each of the numeric variables and turn each categorical variable into a series of dummy or indicator variables. Numeric variables are scaled with mean 0 and standard deviation 1 so that it is easier to compare variables that have a different scale (e.g. interest rate vs. LTV). Additionally, standardizing is also a requirement for many algorithms (e.g. principal component analysis).

Categorical variables are transformed by turning each n value of the variable into its own yes/no feature. For example, Property State originally has 50 possible values, so it will be turned into 50 variables (e.g. Alabama yes/no, Alaska yes/no). For categorical variables with many values this transformation will significantly increase the number of variables in the model.

After scaling and transforming the dataset, the final shape is 199,716 rows and 106 columns. The target variable—loan delinquency—has 186,094 ‘no’ values and 13,622 ‘yes’ values. The data are now ready to be used to build, evaluate, and tune machine learning models.

3 Model Selection

Because the target variable loan delinquency is binary (yes/no) the methods available will be classification machine learning models. There are many classification models, including but not limited to: neural networks, logistic regression, support vector machines, decision trees and nearest neighbors. It is always beneficial to seek out domain expertise when tackling a problem to learn best practices and reduce the number of model builds. For this example, two approaches will be tried—nearest neighbors and decision tree.

The first step is to split the dataset into two segments: training and testing. For this example, 40% of the data will be partitioned into the test set, and 60% will remain as the training set. The resulting segmentations are as follows:

1. 60% of the observations (as training set)- X_train

2. The associated target (loan delinquency) for each observation in X_train- y_train

3. 40% of the observations (as test set)- X_test

4. The targets associated with the test set- y_test

Data should be randomly shuffled before they are split, as datasets are often in some type of meaningful order. Once the data are segmented the model will first be exposed to the training data to begin learning.

4 K-Nearest Neighbors Classifier

Training a K-neighbors model requires the fitting of the model on X_train (variables) and y_train (target) training observations. Once the model is fit, a summary of the model hyperparameters is returned. Hyperparameters are model parameters not learned automatically but rather are selected by the model creator.

The K-neighbors algorithm searches for the K closest (i.e., most similar) training examples for each test observation using a metric that calculates the distance between observations in high-dimensional space. Once the nearest neighbors are identified, a predicted class label is generated as the class that is most prevalent in the neighbors. The biggest challenge with a K-neighbors classifier is choosing the number of K neighbors to use. Another significant consideration is the type of distance metric to use.

To see more clearly how this method works, the 6 nearest neighbors of two random observations from the training set were selected, one that is a non-default (0 label) observation and one that is not.

Random delinquent observation: 28919 
Random non delinquent observation: 59504

The indices and minkowski distances to the 6 nearest neighbors of the two random observations are found below. Unsurprisingly, the first nearest neighbor is always itself and the first distance is 0.

Indices of closest neighbors of obs. 28919 [28919 112677 88645 103919 27218 15512]
Distance of 5 closest neighbor for obs. 28919 [0 0.703 0.842 0.883 0.973 1.011]

Indices of 5 closest neighbors for obs. 59504 [59504 87483 25903 22212 96220 118043]
Distance of 5 closest neighbor for obs. 59504 [0 0.873 1.185 1.186 1.464 1.488]

Recall that in order to make a classification prediction, the kneighbors algorithm finds the K nearest neighbors of each observation. Each neighbor is given a ‘vote’ via their class label, and the majority vote wins. Below are the labels (or votes) of either 0 (non-delinquent) or 1 (delinquent) for the 6 nearest neighbors of the random observations. Based on the voting below, the delinquent observation would be classified correctly as 3 of the 5 nearest neighbors (excluding itself) are also delinquent. The non-delinquent observation would also be classified correctly, with 4 of 5 neighbors voting non-delinquent.

Delinquency label of nearest neighbors- non delinquent observation: [0 1 0 0 0 0]
Delinquency label of nearest neighbors- delinquent observation: [1 0 1 1 0 1]

5 Tree-Based Classifier

Tree based classifiers learn by segmenting the variable space into a number of distinct regions or nodes. This is accomplished via a process called recursive binary splitting. During this process observations are continuously split into two groups by selecting the variable and cutoff value that results in the highest node purity where purity is defined as the measure of variance across the two classes. The two most popular purity metrics are the gini index and cross entropy. A low value for these metrics indicates that the resulting node is pure and contains predominantly observations from the same class. Just like the nearest neighbor classifier, the decision tree classifier makes classification decisions by ‘votes’ from observations within each final node (known as the leaf node).

To illustrate how this works, a decision tree was created with the number of splitting rules (max depth) limited to 5. An excerpt of this tree can be found below. All 120,000 training examples start together in the top box. From top to bottom, each box shows the variable and splitting rule applied to the observations, the value of the gini metric, the number of observations the rule was applied to, and the current segmentation of the target variable. The first box indicates that the 6th variable (represented by the 5th index ‘X[5]’) Borrower Credit Score was used to split the training examples. Observations where the value of Borrower Credit Score was below or equal to -0.4413 follow the line to the box on the left. This box shows that 40,262 samples met the criteria. This box also holds the next splitting rule, also applied to the Borrower Credit Score variable. This process continues with X[2] (Original LTV) and so on until the tree is finished growing to its depth of 5. The final segments at the bottom of the tree are the aforementioned leaf nodes which are used to make classification decisions. When making a prediction on new observations, the same splitting rules are applied and the observation receives the label of the most commonly occurring class in its leaf node.

[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_single_image image=”1086″][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]A more advanced tree based classifier is the Random Forest Classifier. The Random Forest works by generating many individual trees, often hundreds or thousands. However, for each tree, number of variables considered at each split is limited to a random subset. This helps reduce model variance and de-correlate the trees (since each tree will have a different set of available splitting choices). In our example, we fit a random forest classifier on the training data. The resulting hyperparameters and model documentation indicate that by default the model generates 10 trees, considers a random subset of variables the size of the square root of all variables (approximately 10 in this case), has no depth limitation, and only requires each leaf node to have 1 observation.

Since the random forest contains many trees and does not have a depth limitation, it is incredibly difficult to visualize. In order to better understand the model, a plot showing which variables were selected and resulted in the largest drop in the purity metric (gini index) can be useful. Below are the top 10 most important variables in the model, ranked by the total (normalized) reduction to the gini index. Intuitively, this plot can be described as showing which variables can be used to best segment the observations into groups that are predominantly one class, either delinquent and non-delinquent.

6 Model Evaluation

Now that the models have been fitted, their performance must be evaluated. To do this, the fitted model will first be used to generate predictions on the test set (X_test). Next, the predicted class labels are compared to the actual observed class label (y_test). Three of the most popular classification metrics that can be used to compare the predicted and actual values are recall, precision, and the f1-score. These metrics are calculated for each class, delinquent and not-delinquent.

Recall is calculated for each class as the ratio of events that were correctly predicted. More precisely, it is defined as the number of true positive predictions divided by the number of true positive predictions plus false negative predictions. For example, if the data had 10 delinquent observations and 7 were correctly predicted, recall for delinquent observations would be 7/10 or 70%.

Precision is the number of true positives divided by the number of true positives plus false positives. Precision can be thought of as the ratio of events correctly predicted to the total number of events predicted. In the hypothetical example above, assume that the model made a total of 14 predictions for the label delinquent. If so, then the precision for delinquent predictions would be 7/14 or 50%.

The f1 score is calculated as the harmonic mean of recall and precision: (2(Precision*Recall/Precision+Recall)).

The classification reports for the K-neighbors and decision tree below show the precision, recall, and f1 scores for label 0 (non-delinquent) and 1 (delinquent).

There is no silver bullet for choosing a model—often it comes down to the goals of implementation. In this situation, the tradeoff between identifying more delinquent loans at the cost of misclassification can be analyzed with a specific tool called a roc curve. When the model predicts a class label, a probability threshold is used to make the decision. This threshold is set by default at 50% so that observations with more than a 50% chance of membership belong to one class and vice-versa.

The majority vote (of the neighbor observations or the leaf node observations) determines the predicted label. Roc curves allow us to see the impact of varying this voting threshold by plotting the true positive prediction rate against the false positive prediction rate for each threshold value between 0% and 100%.

The area under the ROC curve (AUC) quantifies the model’s ability to distinguish between delinquent and non-delinquent observations. A completely useless model will have an AUC of .5 as the probability for each event is equal. A perfect model will have an AUC of 1 as it is able to perfectly predict each class.

To better illustrate, the ROC curves plotting the true positive and false positive rate on the held-out test set as the threshold is changed are plotted below.

7 Model Tuning

Up to this point the models have been built and evaluated using a single train/test split of the data. In practice this is often insufficient because a single split does not always provide the most robust estimate of the error on the test set. Additionally, there are more steps required for model tuning. To solve both of these problems it is common to train multiple instances of a model using cross validation. In K-fold cross validation, the training data that was first created gets split into a third dataset called the validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated K times, each time holding out a different portion of the training set to validate against. Once the model has been tuned using the train/validation splits, it is tested against the held out test set just as before. As a general rule, once data have been used to make a decision about the model they should never be used for evaluation.

8 K-Nearest Neighbors Tuning

Below a grid search approach is used to tune the K-nearest neighbors model. The first step is to define all of the possible hyperparameters to try in the model. For the KNN model, the list nk = [10, 50, 100, 150, 200, 250] specifies the number of nearest neighbors to try in each model. The list is used by the function GridSearchCV to build a series of models, each using the different value of nk. By default, GridSearchCV uses 3-fold cross validation. This means that the model will evaluate 3 train/validate splits of the data for each value of nk. Also specified in GridSearchCV is the scoring parameter used to evaluate each model. In this instance it is set to the metric discussed earlier, the area under the roc curve. GridSearchCV will return the best performing model by default, which can then be used to generate predictions on the test set as before. Many more values of K could be specified to search through, and the default minkowski distance could be set to a series of metrics to try. However, this comes at a cost of computation time that increases significantly with each added hyperparameter.

In the plot below the mean training and validation scores of the 3 cross-validated splits is plotted for each value of K. The plot indicates that for the lower values of K the model was overfitting the training data and causing lower validation scores. As K increases, the training score lowers but the validation score increases because the model gets better at generalizing to unseen data.

9 Random Forest Tuning

There are many hyperparameters that can be adjusted to tune the random forest model. We use three in our example: n_estimators, max_features, and min_samples_leaf. N_estimators refers to the number of trees to be created. This value can be increased substantially, so the search space is set to list estimators. Random Forests are generally very robust to overfitting, and it is not uncommon to train a classifier with more than 1,000 trees. Second, the number of variables to be randomly considered at each split can be tuned via max_features. Having a smaller value for the number of random features is helpful for decorrelating the trees in the forest, which is especially useful when multicollinearity is present. We tried a number of different values for max_features, which can be found in the list features. Finally, the number of observations required in each leaf node is tuned via the min_samples_leaf parameter and list samples.

The resulting plot, below, shows a subset of the grid search results. Specifically, it shows the mean test score for each number of trees and leaf size when the number of random features considered at each split is limited to 5. The plot demonstrates that the best performance occurs with 500 trees and a requirement of at least 5 observations per leaf. To see the best performing model from the entire grid space the best estimator method can be used.

By default, parameters of the best estimator are assigned to the GridSearch object (cvknc and cvrfc). This object can now be used generate future predictions or predicted probabilities. In our example, the tuned models are used to generate predicted probabilities on the held out test set. The resulting

ROC curves show an improvement in the KNN model from an AUC of .62 to .75. Likewise, the tuned Random Forest AUC improves from .64 to .77.

Predicting loan delinquency using only origination data is not an easy task. Presumably, if significant signal existed in the data it would trigger a change in strategy by MBS investors and ultimately origination practices. Nevertheless, this exercise demonstrates the capability of a machine learning approach to deconstruct such an intricate problem and suggests the appropriateness of using machine learning model to tackle these and other risk management data challenges relating to mortgages and a potentially wide range of asset classes.

« First ‹ Prev 22 23 242526 27 28 Next ›Last »

Common Data Lineage Issues

Addressing Issues

Our Data Lineage and Data Quality Background

MDM Choice and Implementation Priorities

Start Small; Build Gradually; Measure Success

Practical Case Studies

Outlier Detection

Data Clustering

Feature (Variable) Selection

Benchmarking Applications

Sensitivity Analysis and Stress Testing

Conceptual Soundness

Process Verification

Outcomes Analysis

Senior Home Equity Continues Trending Upward

Reverse Mortgage as Retirement Planning Tool

Note on the Limitations of RMMI

Data Selection and Preparation

Algorithm Framework

Algorithm Performance Analysis

Conclusion

1 Variable Selection

2 Heatmaps and Pairwise Grids

3 Model Selection

4 K-Nearest Neighbors Classifier

5 Tree-Based Classifier

6 Model Evaluation

7 Model Tuning

8 K-Nearest Neighbors Tuning

9 Random Forest Tuning

Company

Products

Security & Compliance