Linkedin    Twitter   Facebook

Get Started
Log In

Linkedin

Articles Tagged with: Innovation and Alternative Data

December 2 Workshop: Structured Data Extraction from Image with Google Document AI

Recorded: Dec. 2nd | 1:00 p.m. EDT

RiskSpan Director Steven Sun shares a procedural approach to tackling the difficulties of efficiently extracting structured data from images, scanned documents, and handwritten documents using Google’s latest Document AI Solution. This approach greatly improves:

  • Effectiveness and accuracy of extracting data which will be otherwise difficult or impossible, and 
  • Automating and streamlining the process of feeding extracted data into a data analytic framework

Steven Sun

Director, RiskSpan


Executive Interview: Inside the OCC

Watch RiskSpan CEO Bernadette Kogler’s interview with Acting Comptroller of the Currency Brian Brooks.

They discuss many topics include the OCC’s Project REACh, machine learning models to expand the credit box, blockchain’s role in housing finance, and the expanding definition of a chartered institution.

WATCH THE INTERVIEWRead More


Workshop: Measuring and Visualizing Feature Impact & Machine Learning Model Materiality

Recorded: Oct. 28th | 1:00 p.m. EDT

RiskSpan CIO Suhrud Dagli, who discussed how ML is being incorporated into model risk management during our Sep. 30 webinar: Machine Learning in Model Validation, demonstrates in greater detail how machine learning can be used:

  • In input data validations,
  • To measure feature impact, and
  • To visualize how multiple features interact with each other

Suhrud Dagli


Co-Founder & Fintech Lead, RiskSpan


Managing Machine Learning Model Risk

Though the terms are often used interchangeably in casual conversation, machine learning is a subset of artificial intelligence. Simply put, ML is the process of getting a computer to learn the properties of one dataset and generalizing this “knowledge” on other datasets.


ML Financial Models

ML models have crept into virtually every corner of banking and finance — from fraud and money-laundering prevention to credit and prepayment forecasting, trading, servicing, and even marketing. These models take various forms (see Table 1, below). Modelers base their selection of a particular ML technique on a model’s objective and data availability.   

Table 1. ML Models and Application in Finance

Model Application
Linear Regression Credit Risk; Forecasting
Logistic Regression Credit Risk
Monte Carlo Simulation Capital Market; (ALM)
Artificial Neutral Networks Score Card and AML
Decision Trees Regression Models (Random Forest, Bagging) Score Card
Multinomial Logistic Regression Prepayment Projection
Deep Learning Prepayment Projection
Time Series Model Capital Forecasting; Macroeconomics Forecasting Model
Linear Regression with ARIMA Errors Capital Forecasting
Factor Models Short Rate Evolution
Fuzzy Matching AML; OFAC
Linear Discriminant Analysis (LDA) AML; OFAC
K Means Clustering AML; OFAC

 

ML models require large datasets relative to conventional models as well as more sophisticated computer programing and econometric/statistical skills. ML model developers are required to have deep knowledge about the ML model they want to use, its assumptions and limitations, and alternative approaches.

 

ML Model Risk

ML models present many of the same risks that accompany conventional models. As with any model, errors in design or application can lead to performance issues resulting in financial losses, poor decisions, and damage to reputation.

ML is all about algorithms. Failing to understand the mathematical aspects of these algorithms can lead to adopting inefficient optimization algorithms without knowing the nature or the interpretation of the optimization being solved. Making decisions under these circumstances increases model risk and can lead to unreliable outputs.

As sometimes befalls conventional regression models, ML models may perform well on the training data but not on the test data. Their complexity and high dimensionality makes them especially susceptible to overfitting. The poor performance of some ML models when applied beyond the training dataset can translate into a huge source of risk.

Finally, ML models can give rise to unintended consequences when used inappropriately or incorrectly. Model risk is magnified when the goal of a ML model’s algorithm is not aligned with the business problem or doesn’t consider all relevant considerations of the business problem. Model risk also arises when an ML model is used outside the environment for which it was designed. These risks include overstated/understated model outputs and lack of fairness. Table 2, below, presents a more comprehensive list of these risks.

Table 2. Potential risk from ML models

Overfitting
Underfitting
Bias toward protected groups
Interpretability
Complexity
Use of poor-quality data
Job displacement
Models may produce socially unacceptable results
Automation may create model governance issues

 

Managing ML Model Risk

managing ML model risk

It may seem self-evident, but the first step in managing ML model risk consists of reliably  identifying every model in the inventory that relies on machine learning. This exercise is not always as straightforward as it might seem. Successfully identifying all ML models requires MRM departments to incorporate the right information requests into their model determination or model assessment forms. These should include questions designed to identify specific considerations of ML model techniques, algorithms, platforms and capabilities. MRM departments need to adopt a consistent but flexible definition about what constitutes an ML model across the institution. Models developers, owners and users should be trained in identifying ML models and those features that need to be reported in the model identification assessment form.

MRM’s next step involves risk assessing ML models in the inventory. As with traditional models, ML models should be risk assessed based on their complexity, materiality and frequency of use. Because of their complexity, however, ML models require an additional level of screening in order to account for data structure, level of algorithm sophistication, number of hyper-parameters, and how the models are calibrated. The questionnaire MRM uses to assess the risk of its conventional models often needs to be enhanced in order to adequately capture the additional risk dimensions introduced by ML models.

Managing ML model risk also involves not only ensuring that a clear model development and implementation process is in place but also that it is consistent with the business objective and the intended use of the models. Thorough documentation is important for any model, but the need to describe model theory, methodology, design and logic takes on added importance when it comes to ML models. This includes specifying the methodology (regression or classification), the type of model (linear regression, logistic regression natural language processing, etc.), the resampling method (cross-validation, bootstrap) and the subset selection method such as backward, forward or stepwise selection. Obviously, simply stating that the model “relies on a variety of machine learning techniques” is not going to pass muster.

As with traditional models, developers must document the data source, quality and any transformations that are performed. This includes listing the data sources, normalization and sampling techniques, training and test data size, the data dimension reduction technique (principal component, partial least squares, etc.) as well as controls around them. An assessment of the risk around the utilization of certain data should also be assessed.

A model implementation plan and controls around the model should be also be developed.

Finally, all model performance testing should be clearly stated, and the results documented. This helps assess whether the model is performing as intended and in line with its design and business objective. Limitations and calibrations around the models should also be documented.

Like traditional models, ML models require independent validation to ensure they are sound and performing as intended and to identify potential limitations. All components of ML models should be subject to validation, including conceptual soundness, outcomes analysis and ongoing monitoring.

Validators can assess the conceptional soundness of an ML model by evaluating its design and construction, focusing on the theory, methodology, assumptions and limitations, data quality and integrity, hyper-parameter calibration and overlays, bias and interpretability.

Validators can assess outcomes analysis by checking whether the model outputs are appropriate and in line with a priori expectations. Results of the performance metrics should also be assessed for accuracy and degree of precision. Performance metrics for ML models vary by model type. Similar to traditional predictive models, common performance metrics for ML models include the mean-squared-error (MSE), Gini coefficient, entropy, the confusion matrix, and the receiver operating characteristic (ROC) curve.

Outcomes analysis should also include out-of-sample testing, which can be conducted using cross-validation techniques. Finally, ongoing monitoring should be reviewed as a core element of the validation process. Validators should evaluate whether model use is appropriate given changes in products, exposures and market conditions. Validators should also ensure performance metrics are being monitored regularly based on the inherent risk of the model and frequency of use. Validators should ensure that a continuous performance monitoring plan exists and captures the most important metrics. Also, a change control document and access control document should be available.  

The principles outlined above will sound familiar to any experienced model validator—even one with no ML training or experience. ML models do not upend the framework of MRM best practices but rather add a layer of complexity to their implementation. This complexity requires MRM departments in many cases to adjust their existing procedures to property identify ML models and suitably capture the risk emerging from them. As is almost always the case, aggressive staff training to ensure that their well-considered process enhancements are faithfully executed and have their desired effect.       


September 17 Webinar: Using Alternative Data to Widen the Credit Box

Recorded:
Sep. 17th | 1:00 p.m. EDT

RiskSpan’s Bernadette Kogler led a panel of industry experts in a review of the U.S. economy and how mortgage companies can employ alternative data to responsibly extend mortgage credit more broadly to current and potential homeowners.

Participants include 

  • Bernadette Kogler, Co-Founder & CEO, RiskSpan
  • Amy Crews Cutts, President, AC Cutts and Associates
  • Janet Jozwik, Managing Director, RiskSpan
  • Laurie Goodman, Director, Housing Finance Policy Center, The Urban Institute

GET RECORDING


September 30 Webinar: Machine Learning in Model Validation

Recorded: September 30th | 1:00 p.m. EDT

Join our panel of experts as they share their latest work using machine learning to identify and validate model inputs.

  • Suhrud Dagli, Co-Founder & Fintech Lead, RiskSpan
  • Jacob Kosoff, Head of Model Risk Management & Validation, Regions Bank
  • Nick Young, Head of Model Validation, RiskSpan
  • Sanjukta Dhar, Consulting Partner, Risk and Regulatory Compliance Strategic Initiative, TCS Canada


Featured Speakers

Suhrud-Dagli

Suhrud Dagli

Co-Founder & Fintech Lead, RiskSpan

Jacob Kosoff

Head of Model Risk Management & Validation, Regions Bank

dan-kim

Nick Young

Head of Model Validation, RiskSpan

Sanjukta Dhar

Sanjukta Dhar

Consulting Partner, Risk and Regulatory Compliance Strategic Initiative, Tata Consulting


Machine Learning Models: Benefits and Challenges

Having good Prepayment and Credit Models is critical in the analysis of Residential Mortgage-Backed Securities. Prepays and Defaults are the two biggest risk factors that traders, portfolio managers and originators have to deal with. Traditionally, regression-based Behavioral Models have been used to accurately predict human behavior. Since prepayments and defaults are not just complex human decisions but also competing risks, accurately modeling them has been challenging. With the exponential growth in computing power (GPUs, parallel processing), storage (Cloud), “Big Data” (tremendous amount of detailed historical data) and connectivity (high speed internet), Artificial Intelligence (AI) has gained significant importance over the last few years. Machine Learning (ML) is a subset of AI and Deep Learning (DL) is a further subset of ML. The diagram below illustrates this relationship:

AI

Due to the technological advancements mentioned above, ML based prepayment and credit models are now a reality. They can achieve better predictive power than traditional models and can deal effectively with high-dimensionality (more input variables) and non-linear relationships. The major drawback which has kept them from being universally adopted is their “black box” nature which leads to validation and interpretation issues. Let’s do a quick comparison between traditional and ML models:

behavioral models versus machine learning models

Within ML Models are two ways to train them:

  • Supervised Learning  (used for ML Prepay and Credit Models)
    • Regression based
    • Classification based
  • Unsupervised Learning
    • Clustering
    • Association

Let’s compare the major differences between Supervised and Unsupervised Learning:

Supervised learning versus unsupervised learning

The large amounts of loan level time series data available for RMBS (agency and non-agency) lends itself well for the construction of ML models and early adopters have reported higher accuracy. Besides the obvious objections mentioned above (black box, lack of control, interpretation) ML models are also susceptible to overfitting (like all other models). Overfitting is when a model does very well on the training data but less well on unseen data (validation set). The model ends up “memorizing” the noise and outliers in the input data and is not able to generalize accurately. The non-parametric and non-linear nature of ML Models accentuates this problem. Several techniques have been developed to address this potential problem: reducing the complexity of decision trees, expanding the training dataset, adding weak learners, dropouts, regularization, reducing the training time, cross validation etc.. The interpretation problem is a bit more challenging since users demand both, predictive accuracy and some form of interpretability. Several interpretation methods are used currently, like PDP (Partial dependence plot), ALE (accumulated local effects), PFI (permutation feature importance) and ICE (individual conditional expectation) but each has its shortcomings. Some of the challenges with the interpretability methods are:

  • Isolating Cause and Effect – This is not often possible with supervised ML models since they only exploit associations and do not explicitly model cause/effect relationships.
  • Mistaking Correlation for Dependence – Independent variables have a correlation coefficient of zero but a zero correlation coefficient may not imply independence. The correlation coefficient only tracks linear correlations and the non-linear nature of the models makes this difficult.
  • Feature interaction and dependence – An incorrect conclusion can be drawn about the features influence on the target when there are interactions and dependencies between them.

While ML based prepay and credit models offer better predictive accuracy and automatically capture feature interactions and non-linear effects, they are still a few years away from gaining widespread acceptance. A good use for such models, at this stage, would be to use them in conjunction with traditional models. They would be a good benchmark to test traditional models with.


Note: Some of the information on this post was obtained from publicly available sources on the internet. The author wishes to thank  Lei Zhao and Du Tang of the modeling group for proofreading this post.


Is Free Public Data Worth the Cost?

No such thing as a free lunch.

The world is full of free (and semi-free) datasets ripe for the picking. If it’s not going to cost you anything, why not supercharge your data and achieve clarity where once there was only darkness?

But is it really not going to cost you anything? What is the total cost of ownership for a public dataset, and what does it take to distill truly valuable insights from publicly available data? Setting aside the reliability of the public source (a topic for another blog post), free data is anything but free. Let us discuss both the power and the cost of working with public data.

To illustrate the point, we borrow from a classic RiskSpan example: anticipating losses to a portfolio of mortgage loans due to a hurricane—a salient example as we are in the early days of the 2020 hurricane season (and the National Oceanic and Atmospheric Administration (NOAA) predicts a busy one). In this example, you own a portfolio of loans and would like to understand the possible impacts to that portfolio (in terms of delinquencies, defaults, and losses) of a recent hurricane. You know this will likely require an external data source because you do not work for NOAA, your firm is new to owning loans in coastal areas, and you currently have no internal data for loans impacted by hurricanes.

Know the Data.

The first step in using external data is understanding your own data. This may seem like a simple task. But data, its source, its lineage, and its nuanced meaning can be difficult to communicate inside an organization. Unless you work with a dataset regularly (i.e., often), you should approach your own data as if it were provided by an external source. The goal is a full understanding of the data, the data’s meaning, and the data’s limitations, all of which should have a direct impact on the types of analysis you attempt.

Understanding the structure of your data and the limitations it puts on your analysis involves questions like:

  • What objects does your data track?
  • Do you have time series records for these objects?
  • Do you only have the most recent record? The most recent 12 records?
  • Do you have one record that tries to capture life-to-date information?

Understanding the meaning of each attribute captured in your data involves questions like:

  • What attributes are we tracking?
  • Which attributes are updated (monthly or quarterly) and which remain static?
  • What are the nuances in our categorical variables? How exactly did we assign the zero-balance code?
  • Is original balance the loan’s balance at mortgage origination, or the balance when we purchased the loan/pool?
  • Do our loss numbers include forgone interest?

These same types of questions also apply to understanding external data sources, but the answers are not always as readily available. Depending on the quality and availability of the documentation for a public dataset, this exercise may be as simple as just reading the data dictionary, or as labor intensive as generating analytics for individual attributes, such as mean, standard deviation, mode, or even histograms, to attempt to derive an attribute’s meaning directly from the delivered data. This is the not-free part of “free” data, and skipping this step can have negative consequences for the quality of analysis you can perform later.

Returning to our example, we require at least two external data sets:  

  1. where and when hurricanes have struck, and
  2. loan performance data for mortgages active in those areas at those times.

The obvious choice for loan performance data is the historical performance datasets from the GSEs (Fannie Mae and Freddie Mac). Providing monthly performance information and loss information for defaulted loans for a huge sample of mortgage loans over a 20-year period, these two datasets are perfect for our analysis. For hurricanes, some manual effort is required to extract date, severity, and location from NOAA maps like these (you could get really fancy and gather zip codes covered in the landfall area—which, by leaving out homes hundreds of miles away from expected landfall, would likely give you a much better view of what happens to loans actually impacted by a hurricane—but we will stick to state-level in this simple example).

Make new data your own.

So you’ve downloaded the historical datasets, you’ve read the data dictionaries cover-to-cover, you’ve studied historical NOAA maps, and you’ve interrogated your own data teams for the meaning of internal loan data. Now what? This is yet another cost of “free” data: after all your effort to understand and ingest the new data, all you have is another dataset. A clean, well-understood, well-documented (you’ve thoroughly documented it, haven’t you?) dataset, but a dataset nonetheless. Getting the insights you seek requires a separate effort to merge the old with the new. Let us look at a simplified flow for our hurricane example:

  • Subset the GSE data for active loans in hurricane-related states in the month prior to landfall. Extract information for these loans for 12 months after landfall.
  • Bucket the historical loans by the characteristics you use to bucket your own loans (LTV, FICO, delinquency status before landfall, etc.).
  • Derive delinquency and loss information for the buckets for the 12 months after the hurricane.
  • Apply the observed delinquency and loss information to your loan portfolio (bucketed using the same scheme you used for the historical loans).

And there you have it—not a model, but a grounded expectation of loan performance following a hurricane. You have stepped out of the darkness and into the data-driven light. And all using free (or “free”) data!

Hyperbole aside, nothing about our example analysis is easy, but it plainly illustrates the power and cost of publicly available data. The power is obvious in our example: without the external data, we have no basis for generating an expectation of losses after a hurricane. While we should be wary of the impacts of factors not captured by our datasets (like the amount and effectiveness of government intervention after each storm – which does vary widely), the historical precedent we find by averaging many storms can form the basis for a robust and defensible expectation. Even if your firm has had experience with loans in hurricane-impacted areas, expanding the sample size through this exercise bolsters confidence in the outcomes. Generally speaking, the use of public data can provide grounded expectations where there had been only anecdotes.

But this power does come at a price—a price that should be appreciated and factored into the decision whether to use external data in the first place. What is worse than not knowing what to expect after a hurricane? Having an expectation based on bad or misunderstood data. Failing to account for the effort required to ingest and use free data can lead to bad analysis and the temptation to cut corners. The effort required in our example is significant: the GSE data is huge, complicated, and will melt your laptop’s RAM if you are not careful. Turning NOAA PDF maps into usable data is not a trivial task, especially if you want to go deeper than the state level. Understanding your own data can be a challenge. Applying an appropriate bucketing to the loans can make or break the analysis. Not all public datasets present these same challenges, but all public datasets present costs. There simply is no such thing as a free lunch. The returns on free data frequently justify these costs. But they should be understood before unwittingly incurring them.


COVID-19 and the Cloud

COVID-19 creates a need for analytics in real time

Regarding the COVID-19 pandemic, Warren Buffet has observed that we haven’t faced anything that quite resembles this problem” and the fallout is “still hard to evaluate. 

The pandemic has created unprecedented shock to economies and asset performance. The recent unemployment  data, although encouraging , has only added to the uncertaintyFurthermore, impact and recovery are unevenoften varying considerably from county to county and city to city. Consider: 

  1. COVID-19 cases and fatalities were initially concentrated in just a few cities and counties resulting in almost a total shutdown of these regions. 
  2. Certain sectors, such as travel and leisure, have been affected worse than others while other sectors such as oil and gas have additional issues. Regions with exposure to these sectors have higher unemployment rates even with fewer COVID-19 cases. 
  3. Timing of reopening and recoveries has also varied due to regional and political factors. 

Regional employment, business activity, consumer spending and several other macro factors are changing in real time. This information is available through several non-traditional data sources. 

Legacy models are not working, and several known correlations are broken. 

Determining value and risk in this environment is requiring unprecedented quantities of analytics and on-demand computational bandwidth. 

COVID-19 in the Cloud

Need for on-demand computation and storage across the organization 

I don’t need a hard disk in my computer if I can get to the server faster… carrying around these non-connected computers is byzantine by comparison.” ~ Steve Jobs 


Front office, risk management, quants and model risk management – every aspect of the analytics ecosystem requires the ability to run large number of scenarios quickly. 

Portfolio managers need to recalibrate asset valuation, manage hedges and answer questions from senior management, all while looking for opportunities to find cheap assets. Risk managers are working closely with quants and portfolio managers to better understand the impact of this unprecedented environment on assets. Quants must not only support existing risk and valuation processes but also be able to run new estimations and explain model behavior as data streams in from variety of sources. 

These activities require several processors and large storage units to be stood up on-demand. Even in normal times infrastructure teams require at least 10 to 12 weeks to procure and deploy additional hardware. With most of the financial services world now working remotely, this time lag is further exaggerated.  

No individual firm maintains enough excess capacity to accommodate such a large and urgent need for data and computation. 

The work-from-home model has proven that we have sufficient internet bandwidth to enable the fast access required to host and use data on the cloud. 

Cloud is about how you do computing

“Cloud is about how you do computing, not where you do computing.” ~ Paul Maritz, CEO of VMware 


Cloud computing is now part of everyday vocabulary and powers even the most common consumer devices. However, financial services firms are still in early stages of evaluating and transitioning to a cloud-based computing environment. 

Cloud is the only way to procure the level of surge capacity required today. At RiskSpan we are computing an average of half-million additional scenarios per client on demand. Users don’t have the luxury to wait for an overnight batch process to react to changing market conditions. End users fire off a new scenario assuming that the hardware will scale up automagically. 

When searching Google’s large dataset or using Salesforce to run analytics we expect the hardware scaling to be limitless. Unfortunately, valuation and risk management software are typically built to run on a pre-defined hardware configuration.  

Cloud native applications, in contrast, are designed and built to leverage the on-demand scaling of a cloud platform. Valuation and risk management products offered as SaaS scale on-demand, managing the integration with cloud platforms. 

Financial services firms don’t need to take on the burden of rewriting their software to work on the cloud. Platforms such as RS Edge enable clients to plug their existing data, assumptions and models into a cloudnative platform. This enables them to get all the analytics they’ve always had—just faster and cheaper.  

Serverless access can also help companies provide access to their quant groups without incurring additional IT resource expense. 

A recent survey from Flexera shows that 30% of enterprises have increased their cloud usage significantly due to COVID-19.

COVID-19 in the Cloud

Cloud is cost effective 

In 2000, when my partner Ben Horowitz was CEO of the first cloud computing company, Loudcloud, the cost of a customer running a basic Internet application was approximately $150,000 a month.”  ~ Marc Andreessen, Co-founder of Netscape, Board Member of Facebook 


Cloud hardware is cost effective, primarily due to the on-demand nature of the pricing model. $250B asset manager uses RS Edge to run millions of scenarios for a 45minute period every day. Analysis is performed over a thousand servers at a cost of $500 per month. The same hardware if deployed for 24 hours would cost $27,000 per month 

Cloud is not free and can be a two-edged sword. The same on-demand aspect thaenables end users to spin up servers as needed, if not monitoredcan cause the cost of such servers to accumulate to undesirable levelsOne of the benefits of a cloud-native platform is built-on procedures to drop unused servers, which minimizes the risk of paying for unused bandwidth. 

And yes, Mr. Andreeseen’s basic application can be hosted today for less than $100 per month 

The same survey from Flexera shows that organizations plan to increase public cloud spending by 47% over the next 12 months. 

COVID-19 in the Cloud

Alternate data analysis

“The temptation to form premature theories upon insufficient data is the bane of our profession.” ~ Sir Arthur Conan Doyle, Sherlock Holmes.


Alternate data sources are not always easily accessible and available within analytic applications. The effort and time required to integrate them can be wasted if the usefulness of the information cannot be determined upfront. Timing of analyzing and applying the data is key. 

Machine learning techniques offer quick and robust ways of analyzing data. Tools to run these algorithms are not readily available on a desktop computer.  

Every major cloud platform provides a wealth of tools, algorithms and pre-trained models to integrate and analyze large and messy alternate datasets. 

Join fintova’s Gary Maier and me at 1 p.m. EDT on June 24th as we discuss other important factors to consider when performing analytics in the cloud. Register now.


Webinar: Data Analytics and Modeling in the Cloud – June 24th

On Wednesday, June 24th, at 1:00 PM EDT, join Suhrud Dagli, RiskSpan’s co-founder and chief innovator, and Gary Maier, managing principal of Fintova for a free RiskSpan webinar.

Suhrud and Gary will contrast the pros and cons of analytic solutions native to leading cloud platforms, as well as tips for ensuring data security and managing costs.

Click here to register for the webinar.


Get Started
Log in

Linkedin   

risktech2024