Linkedin    Twitter   Facebook

Get Started
Log In

Linkedin

Articles Tagged with: Data Management

Is Free Public Data Worth the Cost?

No such thing as a free lunch.

The world is full of free (and semi-free) datasets ripe for the picking. If it’s not going to cost you anything, why not supercharge your data and achieve clarity where once there was only darkness?

But is it really not going to cost you anything? What is the total cost of ownership for a public dataset, and what does it take to distill truly valuable insights from publicly available data? Setting aside the reliability of the public source (a topic for another blog post), free data is anything but free. Let us discuss both the power and the cost of working with public data.

To illustrate the point, we borrow from a classic RiskSpan example: anticipating losses to a portfolio of mortgage loans due to a hurricane—a salient example as we are in the early days of the 2020 hurricane season (and the National Oceanic and Atmospheric Administration (NOAA) predicts a busy one). In this example, you own a portfolio of loans and would like to understand the possible impacts to that portfolio (in terms of delinquencies, defaults, and losses) of a recent hurricane. You know this will likely require an external data source because you do not work for NOAA, your firm is new to owning loans in coastal areas, and you currently have no internal data for loans impacted by hurricanes.

Know the Data.

The first step in using external data is understanding your own data. This may seem like a simple task. But data, its source, its lineage, and its nuanced meaning can be difficult to communicate inside an organization. Unless you work with a dataset regularly (i.e., often), you should approach your own data as if it were provided by an external source. The goal is a full understanding of the data, the data’s meaning, and the data’s limitations, all of which should have a direct impact on the types of analysis you attempt.

Understanding the structure of your data and the limitations it puts on your analysis involves questions like:

  • What objects does your data track?
  • Do you have time series records for these objects?
  • Do you only have the most recent record? The most recent 12 records?
  • Do you have one record that tries to capture life-to-date information?

Understanding the meaning of each attribute captured in your data involves questions like:

  • What attributes are we tracking?
  • Which attributes are updated (monthly or quarterly) and which remain static?
  • What are the nuances in our categorical variables? How exactly did we assign the zero-balance code?
  • Is original balance the loan’s balance at mortgage origination, or the balance when we purchased the loan/pool?
  • Do our loss numbers include forgone interest?

These same types of questions also apply to understanding external data sources, but the answers are not always as readily available. Depending on the quality and availability of the documentation for a public dataset, this exercise may be as simple as just reading the data dictionary, or as labor intensive as generating analytics for individual attributes, such as mean, standard deviation, mode, or even histograms, to attempt to derive an attribute’s meaning directly from the delivered data. This is the not-free part of “free” data, and skipping this step can have negative consequences for the quality of analysis you can perform later.

Returning to our example, we require at least two external data sets:  

  1. where and when hurricanes have struck, and
  2. loan performance data for mortgages active in those areas at those times.

The obvious choice for loan performance data is the historical performance datasets from the GSEs (Fannie Mae and Freddie Mac). Providing monthly performance information and loss information for defaulted loans for a huge sample of mortgage loans over a 20-year period, these two datasets are perfect for our analysis. For hurricanes, some manual effort is required to extract date, severity, and location from NOAA maps like these (you could get really fancy and gather zip codes covered in the landfall area—which, by leaving out homes hundreds of miles away from expected landfall, would likely give you a much better view of what happens to loans actually impacted by a hurricane—but we will stick to state-level in this simple example).

Make new data your own.

So you’ve downloaded the historical datasets, you’ve read the data dictionaries cover-to-cover, you’ve studied historical NOAA maps, and you’ve interrogated your own data teams for the meaning of internal loan data. Now what? This is yet another cost of “free” data: after all your effort to understand and ingest the new data, all you have is another dataset. A clean, well-understood, well-documented (you’ve thoroughly documented it, haven’t you?) dataset, but a dataset nonetheless. Getting the insights you seek requires a separate effort to merge the old with the new. Let us look at a simplified flow for our hurricane example:

  • Subset the GSE data for active loans in hurricane-related states in the month prior to landfall. Extract information for these loans for 12 months after landfall.
  • Bucket the historical loans by the characteristics you use to bucket your own loans (LTV, FICO, delinquency status before landfall, etc.).
  • Derive delinquency and loss information for the buckets for the 12 months after the hurricane.
  • Apply the observed delinquency and loss information to your loan portfolio (bucketed using the same scheme you used for the historical loans).

And there you have it—not a model, but a grounded expectation of loan performance following a hurricane. You have stepped out of the darkness and into the data-driven light. And all using free (or “free”) data!

Hyperbole aside, nothing about our example analysis is easy, but it plainly illustrates the power and cost of publicly available data. The power is obvious in our example: without the external data, we have no basis for generating an expectation of losses after a hurricane. While we should be wary of the impacts of factors not captured by our datasets (like the amount and effectiveness of government intervention after each storm – which does vary widely), the historical precedent we find by averaging many storms can form the basis for a robust and defensible expectation. Even if your firm has had experience with loans in hurricane-impacted areas, expanding the sample size through this exercise bolsters confidence in the outcomes. Generally speaking, the use of public data can provide grounded expectations where there had been only anecdotes.

But this power does come at a price—a price that should be appreciated and factored into the decision whether to use external data in the first place. What is worse than not knowing what to expect after a hurricane? Having an expectation based on bad or misunderstood data. Failing to account for the effort required to ingest and use free data can lead to bad analysis and the temptation to cut corners. The effort required in our example is significant: the GSE data is huge, complicated, and will melt your laptop’s RAM if you are not careful. Turning NOAA PDF maps into usable data is not a trivial task, especially if you want to go deeper than the state level. Understanding your own data can be a challenge. Applying an appropriate bucketing to the loans can make or break the analysis. Not all public datasets present these same challenges, but all public datasets present costs. There simply is no such thing as a free lunch. The returns on free data frequently justify these costs. But they should be understood before unwittingly incurring them.


Webinar: Data Analytics and Modeling in the Cloud – June 24th

On Wednesday, June 24th, at 1:00 PM EDT, join Suhrud Dagli, RiskSpan’s co-founder and chief innovator, and Gary Maier, managing principal of Fintova for a free RiskSpan webinar.

Suhrud and Gary will contrast the pros and cons of analytic solutions native to leading cloud platforms, as well as tips for ensuring data security and managing costs.

Click here to register for the webinar.


Webinar: Using Machine Learning in Whole Loan Data Prep

webinar

Using Machine Learning in Whole Loan Data Prep

Tackle one of your biggest obstacles: Curating and normalizing multiple, disparate data sets.

Learn from RiskSpan experts:

  • How to leverage machine learning to help streamline whole loan data prep
  • Innovative ways to manage the differences in large data sets
  • How to automate ‘the boring stuff’


About The Hosts

LC Yarnelle

Director – RiskSpan

LC Yarnelle is a Director with experience in financial modeling, business operations, requirements gathering and process design. At RiskSpan, LC has worked on model validation and business process improvement/documentation projects. He also led the development of one of RiskSpan’s software offerings, and has led multiple development projects for clients, utilizing both Waterfall and Agile frameworks.  Prior to RiskSpan, LC was as an analyst at NVR Mortgage in the secondary marketing group in Reston, VA, where he was responsible for daily pricing, as well as on-going process improvement activities.  Before a career move into finance, LC was the director of operations and a minority owner of a small business in Fort Wayne, IN. He holds a BA from Wittenberg University, as well as an MBA from Ohio State University. 

Matt Steele

Senior Analyst – RiskSpan

LC Yarnelle is a Director with experience in financial modeling, business operations, requirements gathering and process design. At RiskSpan, LC has worked on model validation and business process improvement/documentation projects. He also led the development of one of RiskSpan’s software offerings, and has led multiple development projects for clients, utilizing both Waterfall and Agile frameworks.  Prior to RiskSpan, LC was as an analyst at NVR Mortgage in the secondary marketing group in Reston, VA, where he was responsible for daily pricing, as well as on-going process improvement activities.  Before a career move into finance, LC was the director of operations and a minority owner of a small business in Fort Wayne, IN. He holds a BA from Wittenberg University, as well as an MBA from Ohio State University. 


Residential Mortgage REIT: End to End Loan Data Management and Analytics

An inflexible, locally installed risk management system with dated technology required a large IT staff to support it and was incurring high internal maintenance costs.

Absent a single solution, the use of multiple vendors for pricing and risk analytics, prepay/credit models and data storage created inefficiencies in workflow and an administrative burden to maintain.

Inconsistent data and QC across the various sources was also creating a number of data integrity issues.

The Solution

An end-to-end data and risk management solution. The REIT implemented RiskSpan’s Edge Platform, which provides value, cost and operational efficiencies.

  • Scalable, cloud-native technology
  • Increased flexibility to run analytics at loan level; additional interactive / ad-hoc analytics
  • Reliable, accurate data with more frequent updates

Deliverables 

Consolidating from five vendors down to a single platform enabled the REIT to streamline workflows and automate processes, resulting in a 32% annual cost savings and 46% fewer resources required for maintenance.


GSE: Earnings Forecasting Framework Development

A $100+ billion government-sponsored enterprise with more than $3 trillion in assets sought to develop an end-to-end earnings forecast framework to project and stress-test the future performance of its loan portfolio. The comprehensive framework needed to draw data from a combination of unintegrated systems to compute earnings, capital management requirements and other ad hoc reporting under a variety of internal and regulatory (i.e., DFAST) stress scenarios. 

Computing the required metrics required cross-functional team coordination, proper data governance, and a reliable audit trail, all of which were posing a challenge.  

The Solution

RiskSpan addressed these needs via three interdependent workstreams: 

Data Preparation

RiskSpan consolidated multiple data sources required by the earnings forecast framework. These included: 

  • Macroeconomic drivers, including interest rates and unemployment rate 
  • Book profile, including up-to-date snapshots of the portfolio’s performance data 
  • Modeling assumptions, including portfolio performance history and other asset characteristics 

Model Simulation

Because the portfolio in question consisted principally of mortgage assets, RiskSpan incorporated more than 20 models into the framework, including (among others): 

  • Prepayment Model 
  • Default Model 
  • Delinquency Model 
  • Acquisition Model: Future loans 
  • Severity Model  
  • Cash Flow Model 

Business Calculations and Reporting

Using the data and models above, RiskSpan incorporated the following outputs into the earnings forecast framework: 

  • Non-performing asset treatment 
  • When to charge-off delinquent loans 
  • Projected loan losses under FAS114/CECL  
  • Revenue Forecasts 
  • Capital Forecast 

Client Benefits

The earnings forecast framework RiskSpan developed represented a significant improvement over the client’s previous system of disconnected data, unintegrated models, and error-prone workarounds. Benefits of the new system included:  

  • User Interface – Improved process for managing loan lifecycles and GUI-based process execution  
  • Data Lineage – Implemented necessary constraints to ensure forecasting processes are executed in sequence and are repeatable. Created a predefined, dynamic output lineage tree (UI-accessible) to build robust data flow sequence used to facilitate what-if scenario analysis. 
  • Run Management – Assigned a unique run ID to every execution to ensure individual users across the institution can track and reuse execution results 
  • Audit Trail – Designed logging of forecasting run details to trace attributes such as version changes (Version control system – GIT, SVN), timestamp, run owner, and inputs used (MySQL/Oracle Databases for logging)  
  • Identity Access Management – User IDs and access is now managed administratively. Metadata is captured via user actions through the framework for audit purposes. Role-based restrictions now ensure data and forecasting features are limited to only those who require such permissions 
  • Golden Configuration – Implemented execution-specific parameters passed to models during runtime. These parameters are stored, enabling any past model result to be reproduced if needed 
  • Data Masking – Encrypted personally identifiable information at-rest and in transit 
  • Data Management – Execution logs and model/report outputs are stored to the database and file systems 
  • Comprehensive User and Technical Documentation – RiskSpan created audit-ready documentation tied to logic changes and execution. This included source-to-target mapping documentation and enterprise-grade catalogs and data dictionaries. Documentation also included: 
      • Vision Document 
      • User Guides 
      • Testing Evidence 
      • Feature Traceability Matrix 


Automate Your Data Normalization and Validation Processes

Robotic Process Automation (RPA) is the solution for automating mundane, business-rule based processes so that organizations high value business users can be deployed to more valuable work. 

McKinsey defines RPA as “software that performs redundant tasks on a timed basis and ensures that they are completed quickly, efficiently, and without error.” RPA has enormous savings potential. In RiskSpan’s experience, RPA reduces staff time spent on the target-state process by an average of 95 percent. On recent projects, RiskSpan RPA clients on average saved more than 500 staff hours per year through simple automation. That calculation does not include the potential additional savings gained from the improved accuracy of source data and downstream data-driven processes, which greatly reduces the need for rework. 

The tedious, error-ridden, and time-consuming process of data normalization is familiar to almost all organizations. Complex data systems and downstream analytics are ubiquitous in today’s workplace. Staff that are tasked with data onboarding must verify that source data is complete and mappable to the target system. For example, they might ensure that original balance is expressed as dollar currency figures or that interest rates are expressed as percentages with three decimal places. 

Effective data visualizations sometimes require additional steps, such as adding calculated columns or resorting data according to custom criteria. Staff must match the data formatting requirements with the requirements of the analytics engine and verify that the normalization allows the engine to interact with the dataset. When completed manually, all of these steps are susceptible to human error or oversight. This often results in a need for rework downstream and even more staff hours. 

Recently, a client with a proprietary datastore approached RiskSpan with the challenge of normalizing and integrating irregular datasets to comply with their data engine. The non-standard original format and the size of the data made normalization difficult and time consuming. 

After ensuring that the normalization process was optimized for automation, RiskSpan set to work automating data normalization and validation. Expert data consultants automated the process of restructuring data in the required format so that it could be easily ingested by the proprietary engine.  

Our consultants built an automated process that normalized and merged disparate datasets, compared internal and external datasets, and added calculated columns to the data. The processed dataset was more than 100 million loans, and more than 4 billion recordsTo optimize for speed, our team programmed a highly resilient validation process that included automated validation checks, error logging (for client staff review) and data correction routines for post-processing and post-validation. 

This custom solution reduced time spent onboarding data from one month of staff work down to two days of staff work. The end result is a fullyfunctional, normalized dataset that can be trusted for use with downstream applications. 

RiskSpan’s experience automating routine business processes reduced redundancies, eliminated errors, and saved staff time. This solution reduced resources wasted on rework and its associated operational risk and key-person dependencies. Routine tasks were automated with customized validations. This customization effectively eliminated the need for staff intervention until certain error thresholds were breached. The client determined and set these thresholds during the design process. 

RiskSpan data and analytics consultants are experienced in helping clients develop robotic process automation solutions for normalizing and aggregating data, creating routine, reliable data outputsexecuting business rules, and automating quality control testing. Automating these processes addresses a wide range of business challenges and is particularly useful in routine reporting and analysis. 

Talk to RiskSpan today about how custom solutions in robotic process automation can save time and money in your organization. 


GSE: Datamart Design and Build

The Problem

A government-sponsored enterprise needed a centralized data solution for its forecasting process, which involved cross-functional teams from different business lines.​

The firm also sought a cloud-based data warehouse to host forecasting outputs for reporting purposes with faster querying and processing speeds.​

The firm also needed assistance migrating data from legacy data sources to new datamarts. The input and output files and datasets had different sources and were often in different formats. Analysis and transformation were required prior to designing, developing and loading tables.  

The Solution

RiskSpan built and now maintains a new centralized datamart (in both Oracle and Amazon Web Services) for the client’s revenue and loss forecasting processes. This includes data modeling, historical data upload, and the monthly recurring data process.

The Deliverables

  • Analyzed the end-to-end data flow and data elements​
  • Designed data models satisfying business requirements​
  • Processed and mapped forecasting input and output files​
  • Migrated data from legacy databases to the new sources ​
  • Built an Oracle datamart and a cloud-based data warehouse (Amazon Web Services) ​
  • Led development team to develop schemas, tables and views, process scripts to maintain data updates and table partitioning logic​
  • Resolved data issues with the source and assisted in reconciliation of results


GSE: ETL Solutions

The Problem

The client needed ETL solutions for handling data of any complexity or size in a variety of formats and/or from different upstream sources.​

The client’s data management team extracted and processed data from different sources and different types of databases (e.g. Oracle, Netezza, Excel files, SAS datasets, etc.), and needed to load into its Oracle and AWS datamarts for it’s revenue and loss forecasting processes. ​

The client’s forecasting process used very complex large-scale datasets in different formats which needed to be consumed and loaded in an automated and timely manner.

The Solution

RiskSpan was engaged to design, develop and implement ETL (Extract, Transform and Load) solutions for handling input and output data for the client’s revenue and loss forecasting processes. This included dealing with large volumes of data and multiple source systems, transforming and loading data to and from data marts and data ware houses.

The Deliverables

  • Analyzed data sources and developed ETL strategies for different data types and sources​
  • Performed source target mapping in support of report and warehouse technical designs​
  • Implemented business-driven requirements using Informatica ​
  • Collaborated with cross-functional business and development teams to document ETL requirements and turn them into ETL jobs ​
  • Optimized, developed, and maintained integration solutions as necessary to connect legacy data stores and the data warehouses


Case Study: Web Based Data Application Build

The Client

Government Sponsored Enterprise (GSE)

The Problem

The Structured Transactions group of a GSE needed to offer a simpler way for broker-dealers to  create new restructured securities (improved ease of use), that provided flexibility to do business at any hour and reduce the dependence on Structured Transactions team members’ availability. 

The Solution

RiskSpan led the development of a customer-facing web-based application for a GSE. Their structured transactions clients use the application to independently create pools of pools and re-combinable REMIC exchanges (RCRs) with existing pooling and pricing requirements.​

RiskSpan delivered the complete end-to-end technical implementation of the new portal.

The Deliverables

  • Development included self-service web portal that provides RCR, pool-of-pool exchange capabilities, reporting features ​
  • Managed data flows from various internal sources to the portal, providing real-time calculations​
  • Latest technology stack included Angular 2.0, Java for web services​
  • Development, testing, and config control methodology featured DevOps practices, CI/CD pipeline, 100% automated testing with Cucumber, Selenium​
  • GIT, JIRA, Gherkin, Jenkins, Fisheye/Crucible, SauceLabs, for config control, testing, deployment

Case Study: Web Based Data Application Build

The Client

GOVERNMENT SPONSORED ENTERPRISE (GSE)

The Problem

The Structured Transactions group of a GSE needed to offer a simpler way for broker-dealers to  create new restructured securities (improved ease of use), that provided flexibility to do business at any hour and reduce the dependence on Structured Transactions team members’ availability. 


The Solution

RiskSpan led the development of a customer-facing web-based application for a GSE. Their structured transactions clients use the application to independently create pools of pools and re-combinable REMIC exchanges (RCRs) with existing pooling and pricing requirements.​

RiskSpan delivered the complete end-to-end technical implementation of the new portal.


The Deliverables

  • Development included self-service web portal that provides RCR, pool-of-pool exchange capabilities, reporting features ​
  • Managed data flows from various internal sources to the portal, providing real-time calculations​
  • Latest technology stack included Angular 2.0, Java for web services​
  • Development, testing, and config control methodology featured DevOps practices, CI/CD pipeline, 100% automated testing with Cucumber, Selenium​
  • GIT, JIRA, Gherkin, Jenkins, Fisheye/Crucible, SauceLabs, for config control, testing, deployment

CONTACT US

Get Started
Log in

Linkedin   

risktech2024