Get Started
Articles Tagged with: Data Governance

Is Free Public Data Worth the Cost?

No such thing as a free lunch.

The world is full of free (and semi-free) datasets ripe for the picking. If it’s not going to cost you anything, why not supercharge your data and achieve clarity where once there was only darkness?

But is it really not going to cost you anything? What is the total cost of ownership for a public dataset, and what does it take to distill truly valuable insights from publicly available data? Setting aside the reliability of the public source (a topic for another blog post), free data is anything but free. Let us discuss both the power and the cost of working with public data.

To illustrate the point, we borrow from a classic RiskSpan example: anticipating losses to a portfolio of mortgage loans due to a hurricane—a salient example as we are in the early days of the 2020 hurricane season (and the National Oceanic and Atmospheric Administration (NOAA) predicts a busy one). In this example, you own a portfolio of loans and would like to understand the possible impacts to that portfolio (in terms of delinquencies, defaults, and losses) of a recent hurricane. You know this will likely require an external data source because you do not work for NOAA, your firm is new to owning loans in coastal areas, and you currently have no internal data for loans impacted by hurricanes.

Know the Data.

The first step in using external data is understanding your own data. This may seem like a simple task. But data, its source, its lineage, and its nuanced meaning can be difficult to communicate inside an organization. Unless you work with a dataset regularly (i.e., often), you should approach your own data as if it were provided by an external source. The goal is a full understanding of the data, the data’s meaning, and the data’s limitations, all of which should have a direct impact on the types of analysis you attempt.

Understanding the structure of your data and the limitations it puts on your analysis involves questions like:

  • What objects does your data track?
  • Do you have time series records for these objects?
  • Do you only have the most recent record? The most recent 12 records?
  • Do you have one record that tries to capture life-to-date information?

Understanding the meaning of each attribute captured in your data involves questions like:

  • What attributes are we tracking?
  • Which attributes are updated (monthly or quarterly) and which remain static?
  • What are the nuances in our categorical variables? How exactly did we assign the zero-balance code?
  • Is original balance the loan’s balance at mortgage origination, or the balance when we purchased the loan/pool?
  • Do our loss numbers include forgone interest?

These same types of questions also apply to understanding external data sources, but the answers are not always as readily available. Depending on the quality and availability of the documentation for a public dataset, this exercise may be as simple as just reading the data dictionary, or as labor intensive as generating analytics for individual attributes, such as mean, standard deviation, mode, or even histograms, to attempt to derive an attribute’s meaning directly from the delivered data. This is the not-free part of “free” data, and skipping this step can have negative consequences for the quality of analysis you can perform later.

Returning to our example, we require at least two external data sets:  

  1. where and when hurricanes have struck, and
  2. loan performance data for mortgages active in those areas at those times.

The obvious choice for loan performance data is the historical performance datasets from the GSEs (Fannie Mae and Freddie Mac). Providing monthly performance information and loss information for defaulted loans for a huge sample of mortgage loans over a 20-year period, these two datasets are perfect for our analysis. For hurricanes, some manual effort is required to extract date, severity, and location from NOAA maps like these (you could get really fancy and gather zip codes covered in the landfall area—which, by leaving out homes hundreds of miles away from expected landfall, would likely give you a much better view of what happens to loans actually impacted by a hurricane—but we will stick to state-level in this simple example).

Make new data your own.

So you’ve downloaded the historical datasets, you’ve read the data dictionaries cover-to-cover, you’ve studied historical NOAA maps, and you’ve interrogated your own data teams for the meaning of internal loan data. Now what? This is yet another cost of “free” data: after all your effort to understand and ingest the new data, all you have is another dataset. A clean, well-understood, well-documented (you’ve thoroughly documented it, haven’t you?) dataset, but a dataset nonetheless. Getting the insights you seek requires a separate effort to merge the old with the new. Let us look at a simplified flow for our hurricane example:

  • Subset the GSE data for active loans in hurricane-related states in the month prior to landfall. Extract information for these loans for 12 months after landfall.
  • Bucket the historical loans by the characteristics you use to bucket your own loans (LTV, FICO, delinquency status before landfall, etc.).
  • Derive delinquency and loss information for the buckets for the 12 months after the hurricane.
  • Apply the observed delinquency and loss information to your loan portfolio (bucketed using the same scheme you used for the historical loans).

And there you have it—not a model, but a grounded expectation of loan performance following a hurricane. You have stepped out of the darkness and into the data-driven light. And all using free (or “free”) data!

Hyperbole aside, nothing about our example analysis is easy, but it plainly illustrates the power and cost of publicly available data. The power is obvious in our example: without the external data, we have no basis for generating an expectation of losses after a hurricane. While we should be wary of the impacts of factors not captured by our datasets (like the amount and effectiveness of government intervention after each storm – which does vary widely), the historical precedent we find by averaging many storms can form the basis for a robust and defensible expectation. Even if your firm has had experience with loans in hurricane-impacted areas, expanding the sample size through this exercise bolsters confidence in the outcomes. Generally speaking, the use of public data can provide grounded expectations where there had been only anecdotes.

But this power does come at a price—a price that should be appreciated and factored into the decision whether to use external data in the first place. What is worse than not knowing what to expect after a hurricane? Having an expectation based on bad or misunderstood data. Failing to account for the effort required to ingest and use free data can lead to bad analysis and the temptation to cut corners. The effort required in our example is significant: the GSE data is huge, complicated, and will melt your laptop’s RAM if you are not careful. Turning NOAA PDF maps into usable data is not a trivial task, especially if you want to go deeper than the state level. Understanding your own data can be a challenge. Applying an appropriate bucketing to the loans can make or break the analysis. Not all public datasets present these same challenges, but all public datasets present costs. There simply is no such thing as a free lunch. The returns on free data frequently justify these costs. But they should be understood before unwittingly incurring them.


Webinar: Data Analytics and Modeling in the Cloud – June 24th

On Wednesday, June 24th, at 1:00 PM EDT, join Suhrud Dagli, RiskSpan’s co-founder and chief innovator, and Gary Maier, managing principal of Fintova for a free RiskSpan webinar.

Suhrud and Gary will contrast the pros and cons of analytic solutions native to leading cloud platforms, as well as tips for ensuring data security and managing costs.

Click here to register for the webinar.


Webinar: Using Machine Learning in Whole Loan Data Prep

webinar

Using Machine Learning in Whole Loan Data Prep

Tackle one of your biggest obstacles: Curating and normalizing multiple, disparate data sets.

Learn from RiskSpan experts:

  • How to leverage machine learning to help streamline whole loan data prep
  • Innovative ways to manage the differences in large data sets
  • How to automate ‘the boring stuff’


About The Hosts

LC Yarnelle

Director – RiskSpan

LC Yarnelle is a Director with experience in financial modeling, business operations, requirements gathering and process design. At RiskSpan, LC has worked on model validation and business process improvement/documentation projects. He also led the development of one of RiskSpan’s software offerings, and has led multiple development projects for clients, utilizing both Waterfall and Agile frameworks.  Prior to RiskSpan, LC was as an analyst at NVR Mortgage in the secondary marketing group in Reston, VA, where he was responsible for daily pricing, as well as on-going process improvement activities.  Before a career move into finance, LC was the director of operations and a minority owner of a small business in Fort Wayne, IN. He holds a BA from Wittenberg University, as well as an MBA from Ohio State University. 

Matt Steele

Senior Analyst – RiskSpan

LC Yarnelle is a Director with experience in financial modeling, business operations, requirements gathering and process design. At RiskSpan, LC has worked on model validation and business process improvement/documentation projects. He also led the development of one of RiskSpan’s software offerings, and has led multiple development projects for clients, utilizing both Waterfall and Agile frameworks.  Prior to RiskSpan, LC was as an analyst at NVR Mortgage in the secondary marketing group in Reston, VA, where he was responsible for daily pricing, as well as on-going process improvement activities.  Before a career move into finance, LC was the director of operations and a minority owner of a small business in Fort Wayne, IN. He holds a BA from Wittenberg University, as well as an MBA from Ohio State University. 


Automate Your Data Normalization and Validation Processes

Robotic Process Automation (RPA) is the solution for automating mundane, business-rule based processes so that organizations high value business users can be deployed to more valuable work. 

McKinsey defines RPA as “software that performs redundant tasks on a timed basis and ensures that they are completed quickly, efficiently, and without error.” RPA has enormous savings potential. In RiskSpan’s experience, RPA reduces staff time spent on the target-state process by an average of 95 percent. On recent projects, RiskSpan RPA clients on average saved more than 500 staff hours per year through simple automation. That calculation does not include the potential additional savings gained from the improved accuracy of source data and downstream data-driven processes, which greatly reduces the need for rework. 

The tedious, error-ridden, and time-consuming process of data normalization is familiar to almost all organizations. Complex data systems and downstream analytics are ubiquitous in today’s workplace. Staff that are tasked with data onboarding must verify that source data is complete and mappable to the target system. For example, they might ensure that original balance is expressed as dollar currency figures or that interest rates are expressed as percentages with three decimal places. 

Effective data visualizations sometimes require additional steps, such as adding calculated columns or resorting data according to custom criteria. Staff must match the data formatting requirements with the requirements of the analytics engine and verify that the normalization allows the engine to interact with the dataset. When completed manually, all of these steps are susceptible to human error or oversight. This often results in a need for rework downstream and even more staff hours. 

Recently, a client with a proprietary datastore approached RiskSpan with the challenge of normalizing and integrating irregular datasets to comply with their data engine. The non-standard original format and the size of the data made normalization difficult and time consuming. 

After ensuring that the normalization process was optimized for automation, RiskSpan set to work automating data normalization and validation. Expert data consultants automated the process of restructuring data in the required format so that it could be easily ingested by the proprietary engine.  

Our consultants built an automated process that normalized and merged disparate datasets, compared internal and external datasets, and added calculated columns to the data. The processed dataset was more than 100 million loans, and more than 4 billion recordsTo optimize for speed, our team programmed a highly resilient validation process that included automated validation checks, error logging (for client staff review) and data correction routines for post-processing and post-validation. 

This custom solution reduced time spent onboarding data from one month of staff work down to two days of staff work. The end result is a fullyfunctional, normalized dataset that can be trusted for use with downstream applications. 

RiskSpan’s experience automating routine business processes reduced redundancies, eliminated errors, and saved staff time. This solution reduced resources wasted on rework and its associated operational risk and key-person dependencies. Routine tasks were automated with customized validations. This customization effectively eliminated the need for staff intervention until certain error thresholds were breached. The client determined and set these thresholds during the design process. 

RiskSpan data and analytics consultants are experienced in helping clients develop robotic process automation solutions for normalizing and aggregating data, creating routine, reliable data outputsexecuting business rules, and automating quality control testing. Automating these processes addresses a wide range of business challenges and is particularly useful in routine reporting and analysis. 

Talk to RiskSpan today about how custom solutions in robotic process automation can save time and money in your organization. 


Case Study: Datamart Design and Build

The Client

Government Sponsored Enterprise (GSE)

The Problem

A GSE needed a centralized data solution for its forecasting process which involved cross-functional teams from different business lines (Single Family, Multi Family, Capital Markets).​

The client also needed a cloud-based data warehouse to host forecasting outputs for reporting purpose with a faster querying and processing speed.​

The input and output files and datasets came from different sources and/or in different formats. Analysis and transformation were required prior to designing, developing and loading tables. The client was also migrating data from legacy data sources to new datamarts. 

The Solution

RiskSpan was engaged to build and maintain a new centralized datamart (in both Oracle and Amazon Web Services) for the client’s revenue and loss forecasting processes. This included data modeling, historical data upload as well as the monthly recurring data process.

The Deliverables

  • Analyzed the end-to-end data flow and data elements​
  • Designed data models satisfying business requirements​
  • Processed and mapped forecasting input and output files​
  • Migrated data from legacy databases to the new sources ​
  • Built an Oracle datamart and a cloud-based data warehouse (Amazon Web Services) ​
  • Led development team to develop schemas, tables and views, process scripts to maintain data updates and table partitioning logic​
  • Resolved data issues with the source and assisted in reconciliation of results

Case Study: ETL Solutions

The Client

Government Sponsored Enterprise (GSE)

The Problem

The client needed ETL solutions for handling data of any complexity or size in a variety of formats and/or from different upstream sources.​

The client’s data management team extracted and processed data from different sources and different types of databases (e.g. Oracle, Netezza, Excel files, SAS datasets, etc.), and needed to load into its Oracle and AWS datamarts for it’s revenue and loss forecasting processes. ​

The client’s forecasting process used very complex large-scale datasets in different formats which needed to be consumed and loaded in an automated and timely manner.

The Solution

RiskSpan was engaged to design, develop and implement ETL (Extract, Transform and Load) solutions for handling input and output data for the client’s revenue and loss forecasting processes. This included dealing with large volumes of data and multiple source systems, transforming and loading data to and from data marts and data ware houses.

The Deliverables

  • Analyzed data sources and developed ETL strategies for different data types and sources​
  • Performed source target mapping in support of report and warehouse technical designs​
  • Implemented business-driven requirements using Informatica ​
  • Collaborated with cross-functional business and development teams to document ETL requirements and turn them into ETL jobs ​
  • Optimized, developed, and maintained integration solutions as necessary to connect legacy data stores and the data warehouses


Case Study: Web Based Data Application Build

The Client

Government Sponsored Enterprise (GSE)

The Problem

The Structured Transactions group of a GSE needed to offer a simpler way for broker-dealers to  create new restructured securities (improved ease of use), that provided flexibility to do business at any hour and reduce the dependence on Structured Transactions team members’ availability. 

The Solution

RiskSpan led the development of a customer-facing web-based application for a GSE. Their structured transactions clients use the application to independently create pools of pools and re-combinable REMIC exchanges (RCRs) with existing pooling and pricing requirements.​

RiskSpan delivered the complete end-to-end technical implementation of the new portal.

The Deliverables

  • Development included self-service web portal that provides RCR, pool-of-pool exchange capabilities, reporting features ​
  • Managed data flows from various internal sources to the portal, providing real-time calculations​
  • Latest technology stack included Angular 2.0, Java for web services​
  • Development, testing, and config control methodology featured DevOps practices, CI/CD pipeline, 100% automated testing with Cucumber, Selenium​
  • GIT, JIRA, Gherkin, Jenkins, Fisheye/Crucible, SauceLabs, for config control, testing, deployment

Case Study: Web Based Data Application Build

The Client

GOVERNMENT SPONSORED ENTERPRISE (GSE)

The Problem

The Structured Transactions group of a GSE needed to offer a simpler way for broker-dealers to  create new restructured securities (improved ease of use), that provided flexibility to do business at any hour and reduce the dependence on Structured Transactions team members’ availability. 


The Solution

RiskSpan led the development of a customer-facing web-based application for a GSE. Their structured transactions clients use the application to independently create pools of pools and re-combinable REMIC exchanges (RCRs) with existing pooling and pricing requirements.​

RiskSpan delivered the complete end-to-end technical implementation of the new portal.


The Deliverables

  • Development included self-service web portal that provides RCR, pool-of-pool exchange capabilities, reporting features ​
  • Managed data flows from various internal sources to the portal, providing real-time calculations​
  • Latest technology stack included Angular 2.0, Java for web services​
  • Development, testing, and config control methodology featured DevOps practices, CI/CD pipeline, 100% automated testing with Cucumber, Selenium​
  • GIT, JIRA, Gherkin, Jenkins, Fisheye/Crucible, SauceLabs, for config control, testing, deployment

Case Study: Loan-Level Capital Reporting Environment​

The Client

Government Sponsored Enterprise (GSE)

The Problem

A GSE and large mortgage securitizer maintained data from multiple work streams in several disparate systems, provided at different frequencies. Quarterly and ad-hoc data aggregation, consolidation, reporting and analytics required a significant amount of time and personnel hours. ​

The client desired configurable integration with source systems, automated acquisition of over 375 million records and performance improvements in report development.

 

The Solution

The client engaged RiskSpan Consulting Services to develop a reporting environment backed by an ETL Engine to automate data acquisition from multiple sources. 

The Deliverables

  • Reviewed system architecture, security protocol, user requirements and data dictionaries to determine feasibility and approach.​
  • Developed a user-configurable ETL Engine, developed in Python, to load data from different sources into a PostgreSQL data repository hosted on Linux server. The engine provides real-time status updates and error tracking.​
  • Developed the reporting module of the ETL Engine in Python to automatically generate client-defined Excel reports, reducing report development time from days to minutes​
  • Made raw and aggregated data available for internal users to connect virtually any reporting tool, including Python, R, Tableau and Excel​
  • Developed a user interface, leveraging the API exposed by the ETL Engine, allowing users to create and schedule jobs as well as stand up user-controlled reporting environments​


RiskSpan Edge Platform API

The RiskSpan Edge Platform API enables direct access to all data from the RS Edge Platform. This includes both aggregate analytics and loan-and pool-level data.  Standard licensed users may build queries in our browser-based graphical interface. But, our API is a channel for power users with programming skills (Python, R, even Excel) and production systems that are incorporating RS Edge Platform components as part of their Service Oriented Architecture (SOA).

Watch RiskSpan Director LC Yarnelle explain the Edge API in this video!

get a demo


Get Started