Linkedin    Twitter   Facebook

Get Started
Log In

Linkedin

Articles Tagged with: Data Management

Big Companies; Big Data Issues

Data issues plague organizations of all sorts and sizes. But generally, the bigger the dataset, and the more transformations the data goes through, the greater the likelihood of problems. Organizations take in data from many different sources, including social media, third-party vendors and other structured and unstructured origins, resulting in massive and complex data storage and management challenges. This post presents ideas to keep in mind when seeking to address these.

First, a couple of definitions:

Data quality generally refers to the fitness of a dataset for its purpose in a given context. Data quality encompasses many related aspects, including:

  • Accuracy,
  • Completeness,
  • Update status,
  • Relevance,
  • Consistency across data sources,
  • Reliability,
  • Appropriateness of presentation, and
  • Accessibility

Data lineage tracks data movement, including its origin and where it moves over time. Data lineage can be represented visually to depict how data flows from its source to its destination via various changes and hops.

The challenges facing many organizations relate to both data quality and data lineage issues, and a considerable amount of time and effort is spent both in tracing the source of data (i.e., its lineage) and correcting errors (i.e., ensuring its quality). Business intelligence and data visualization tools can do a magnificent job of teasing stories out of data, but these stories are only valuable when they are true. It is becoming increasingly vital to adopt best practices to ensure that the massive amounts of data feeding downstream processes and presentation engines are both reliable and properly understood.

Financial institutions must frequently deal with disparate systems either because of mergers and acquisitions or in order to support different product types—consumer lending, commercial banking and credit cards, for example. Disparate systems tend to result in data silos, and substantial time and effort must go into providing compliance reports and meeting the various regulatory requirements associated with analyzing data provenance (from source to destination). Understanding the workflow of data and access controls around security are also vital applications of data lineage and help ensure data quality.

In addition to the obvious need for financial reporting accuracy, maintaining data lineage and quality is vital to identifying redundant business rules and data and to ensuring that reliable, analyzable data is constantly available and accessible. It also helps to improve the data governance echo system, enabling data owners to focus on gleaning business insights from their data rather than focusing attention on rectifying data issues.

Common Data Lineage Issues

A surprising number of data issues emerge simply from uncertainty surrounding a dataset’s provenance. Many of the most common data issues stem from one or more of the following categories:

  • Human error: “Fat fingering” is just the tip of the iceberg. Misconstruing and other issues arising from human intervention are at the heart of virtually all data issues.
  • Incomplete Data: Whether it’s drawing conclusions based on incomplete data or relying on generalizations and judgment to fill in the gaps, many data issues are caused by missing data.
  • Data format: Systems expect to receive data in a certain format. Issues arise when the actual input data departs from these expectations.
  • Data consolidation: Migrating data from legacy systems or attempting to integrate newly acquired data (from a merger, for instance) frequently leads to post-consolidation issues.
  • Data processing: Calculation engines, data aggregators, or any other program designed to transform raw data into something more “usable” always run the risk of creating output data with quality issues.

Addressing Issues

Issues relating to data lineage and data quality are best addressed by employing some combination of the following approaches. The specific blend of approaches depends on the types of issues and data in question, but these principles are broadly applicable.

Employing a top-down discovery approach enables data analysts to understand the key business systems and business data models that drive an application. This approach is most effective when logical data models are linked to the physical data and systems.

Creating a rich metadata repository for all the data elements flowing from the source to destination can be an effective way of heading off potential data lineage issues. Because data lineage is dependent on the metadata information, creating a robust repository from the outset often helps preserve data lineage throughout the life cycle.

Imposing useful data quality rules is an important element in establishing a framework in which data is always validated against a set of well-conceived business rules. Ensuring not only that data passes comprehensive rule sets but also that remediation factors are in place for appropriately dealing with data that fails quality control checks is crucial for ensuring end-to-end data quality.

Data lineage and data quality both require continuous monitoring by a defined stewardship council to ensure that data owners are taking appropriate steps to understand and manage the idiosyncrasies of the datasets they oversee.

Our Data Lineage and Data Quality Background

RiskSpan’s diverse client base includes several large banks (with we define as banks with assets totaling in excess of $50 billion). Large banks are characterized by a complicated web of departments and sub-organizations, each offering multiple products, sometimes to the same base of customers. Different sub-organizations frequently rely on disparate systems (sometimes due to mergers/acquisitions; sometimes just because they develop their businesses independent of one another). Either way, data silos inevitably result.

RiskSpan has worked closely with chief data officers of large banks to help establish data stewardship teams charged with taking ownership of the various “areas” of data within the bank. This involves the identification of data “curators” within each line of business to coordinate with the CDO’s office and be the advocate (and ultimately the responsible party) for the data they “own.” In best practice scenarios, a “data curator” group is formed to facilitate collaboration and effective communication for data work across the line of business.

We have found that a combination of top-down and bottom-up data discovery approaches is most effective when working accross stakeholders to understand existing systems and enterprise data assets. RiskSpan has helped create logical data flow diagrams (based on the top-down approach) and assisted with linking physical data models to the logical data models. We have found Informatica and Collibra tools to be particularly useful in creating data lineage, tracking data owners, and tracing data flow from source to destination.

Complementing our work with financial clients to devise LOB-based data quality rules, we have built data quality dashboards using these same tools to enable data owners and curators to rectify and monitor data quality issues. These projects typically include elements of the following components.

  • Initial assessment review of the current data landscape.
  • Establishment of a logical data flow model using both top-down and bottom-up data discovery approaches.
  • Coordination with the CDO / CIO office to set up a data governance stewardship team and to identify data owners and curators from all parts of the organization.
  • Delineation of data policies, data rules and controls associated with different consumers of the data.
  • Development of a target state model for data lineage and data quality by outlining the process changes from a business perspective.
  • Development of future-state data architecture and associated technology tools for implementing data lineage and data quality.
  • Invitation to client stakeholders to reach a consensus related to future-state model and technology architecture.
  • Creation of a project team to execute data lineage and data quality projects by incorporating the appropriate resources and client stakeholders.
  • Development of a change management and migration strategy to enable users and stakeholders to use data lineage and data quality tools.

Ensuring data quality and lineage is ultimately the responsibility of business lines that own and use the data. Because “data management” is not the principal aim of most businesses, it often behooves them to leverage the principles outlined in this post (sometimes along with outside assistance) to implement tactics that will to help ensure that the stories their data tell are reliable.


MDM to the Rescue for Financial Institutions

Data becomes an asset only when it is efficiently harnessed and managed. Because firms tend to evolve into silos, their data often gets organized that way as well, resulting in multiple references and unnecessary duplication of data that dilute its value. Master Data Management (MDM) architecture helps to avoid these and other pitfalls by applying best practices to maximize data efficiency, controls, and insights.

MDM has particular appeal to banks and other financial institutions where non-integrated systems often make it difficult to maintain a comprehensive, 360-degree view of a customer who simultaneously has, for example, multiple deposit accounts, a mortgage, and a credit card. MDM provides a single, common data reference across systems that traditionally have not communicated well with each other. Customer-level reports can point to one central database instead of searching for data across multiple sources.

Financial institutions also derive considerable benefit from MDM when seeking to comply with regulatory reporting requirements and when generating reports for auditors and other examiners. Mobile banking and the growing number of new payment mechanisms make it increasingly important for financial institutions to have a central source of data intelligence. An MDM strategy enables financial institutions to harness their data and generate more meaningful insights from it by:

  • Eliminating data redundancy and providing one central repository for common data;
  • Cutting across data “silos” (and different versions of the same data) by providing a single source of truth;
  • Streamlining compliance reporting (through the use of a common data source);
  • Increasing operational and business efficiency;
  • Providing robust tools to secure and encrypt sensitive data;
  • Providing a comprehensive 360-degree view of customer data;
  • Fostering data quality and reducing the risks associated with stale or inaccurate data, and;
  • Reducing operating costs associated with data management.

Not surprisingly, there’s a lot to think about when contemplating and implementing a new MDM solution. In this post, we lay out some of the most important things for financial institutions to keep in mind.

 

MDM Choice and Implementation Priorities

MDM is only as good as the data it can see. To this end, the first step is to ensure that all of the institution’s data owners are on board. Obtaining management buy-in to the process and involving all relevant stakeholders is critical to developing a viable solution. This includes ensuring that everyone is “speaking the same language”—that everyone understands the benefits related to MDM in the same way—and  establishing shared goals across the different business units.

Once all the relevant parties are on board, it’s important to identify the scope of the business process within the organization that needs data refinement through MDM. Assess the current state of data quality (including any known data issues) within the process area. Then, identify all master data assets related to the process improvement. This generally involves identifying all necessary data integration for systems of record and the respective subscribing systems that would benefit from MDM’s consistent data. The selected MDM solution should be sufficiently flexible and versatile that it can govern and link any sharable enterprise data and connect to any business domain, including reference data, metadata and any hierarchies.

An MDM “stewardship team” can add value to the process by taking ownership of the various areas within the MDM implementation plan. MDM is just not about technology itself but also involves business and analytical thinking around grouping data for efficient usage. Members of this team need to have the requisite business and technical acumen in order for MDM implementation to be successful. Ideally this team would be responsible for identifying data commonalities across groups and laying out a plan for consolidating them. Understanding the extent of these commonalities helps to optimize architecture-related decisions.

Architecture-related decisions are also a function of how the data is currently stored. Data stored in heterogeneous legacy systems calls for a different sort of MDM solution than does a modern data lake architecture housing big data. The solutions should be sufficiently flexible and scalable to support future growth. Many tools in the marketplace offer MDM solutions. Landing on the right tool requires a fair amount of due diligence and analysis. The following evaluation criteria are often helpful:

  • Enterprise Integration: Seamless integration into the existing enterprise set of tools and workflows is an important consideration for an MDM solution.  Solutions that require large-scale customization efforts tend to carry additional hidden costs.
  • Support for Multiple Devices: Because modern enterprise data must by consumable by a variety of devices (e.g., desktop, tablet and mobile) the selected MDM architecture must support each of these platforms and have multi-device access capability.
  • Cloud and Scalability: With most of today’s technology moving to the cloud, an MDM solution must be able to support a hybrid environment (cloud as well as on-premise). The architecture should be sufficiently scalable to accommodate seasonal and future growth.
  • Security and Compliance: With cyber-attacks becoming more prevalent and compliance and regulatory requirements continuing to proliferate, the MDM architecture must demonstrate capabilities in these areas.

 

Start Small; Build Gradually; Measure Success

MDM implementation can be segmented into small, logical projects based on business units or departments within an organization. Ideally, these projects should be prioritized in such a way that quick wins (with obvious ROI) can be achieved in problem areas first and then scaling outward to other parts of the organization. This sort of stepwise approach may take longer overall but is ultimately more likely to be successful because it demonstrates success early and gives stakeholders confidence about MDM’s benefits.

The success of smaller implementations is easier to measure and see. A small-scale implementation also provides immediate feedback on the technology tool used for MDM—whether it’s fulfilling the needs as envisioned. The larger the implementation, the longer it takes to know whether the process is succeeding or failing and whether alternative tools should be pursued and adopted. The success of the implementation can be measured using the following criteria:

  • Savings on data storage—a result of eliminating data redundancy.
  • Increased ease of data access/search by downstream data consumers.
  • Enhanced data quality—a result of common data centralization.
  • More compact data lineage across the enterprise—a result of standardizing data in one place.

Practical Case Studies

RiskSpan has helped several large banks consolidate multiple data stores across different lines of business. Our MDM professionals work across heterogeneous data sets and teams to create a common reference data architecture that eliminates data duplication, thereby improving data efficiency and reducing redundant data. These professionals have accomplished this using a variety of technologies, including Informatica, Collibra and IBM Infosphere.

Any successful project begins with a survey of the current data landscape and an assessment of existing solutions. Working collaboratively to use this information to form the basis of an approach for implementing a best-practice MDM strategy is the most likely path to success.


Making Data Dictionaries Beautiful Using Graph Databases

Most analysts estimate that for a given project well over half of the time is spent on collecting, transforming, and cleaning data in preparation for analysis. This task is generally regarded as one of the least appetizing portions of the data analysis process and yet it is the most crucial, as trustworthy analyses are borne out of clean, reliable data. Gathering and preparing data for analysis can be either enhanced or hindered based on the data management practices in place at a firm. When data are readily available, clearly defined, and well documented it will lead to faster and higher-quality insights. As the size and variability of data grows, however, so too does the challenge of storing and managing it. Like many firms, RiskSpan manages a multitude of large, complex datasets with varying degrees of similarity and connectedness. To streamline the analysis process and improve the quantity and quality of our insights, we have made our datasets, their attributes, and relationships transparent and quickly accessible using graph database technology. Graph databases differ significantly from traditional relational databases because data are not stored in tables. Instead, data are stored in either a node or a relationship (also called an edge), which is a connection between two nodes. The image below contains a grey node labeled as a dataset and a blue node labeled as a column. The line connecting these two nodes is a relationship which, in this instance, signifies that the dataset contains the column. Graph 1 There are many advantages to this data structure including decreased redundancy. Rather than storing the same “Column1” in multiple tables for each dataset that contain it (as you would in a relational database), you can simply create more relationships between the datasets demonstrated below: Graph 2 With this flexible structure it is possible to create complex schema that remain visually intuitive. In the image below the same grey (dataset) -contains-> blue (column) format is displayed for a large collection of datasets and columns. Even at such a high level, the relationships between datasets and columns reveal patterns about the data. Here are three quick observations:

  1. In the top right corner there is a dataset with many unique columns.
  2. There are two datasets that share many columns between them and have limited connectivity to the other datasets.
  3. Many ubiquitous columns have been pulled to the center of the star pattern via the relationships to the multiple datasets on the outer rim.

Graph 3 In addition to containing labels, nodes can store data as key-value pairs. The image below displays the column “orig_upb” from dataset “FNMA_LLP”, which is one of Fannie Mae’s public datasets that is available on RiskSpan’s Edge Platform. Hovering over the column node displays some information about it, including the name of the field in the RiskSpan Edge platform, its column type, format, and data type. Graph 4 Relationships can also store data in the same key-value format. This is an incredibly useful property which, for the database in this example, can be used to store information specific to a dataset and its relationship to a column. One of the ways in which RiskSpan has utilized this capability is to hold information pertinent to data normalization in the relationships. To make our datasets easier to analyze and combine, we have normalized the formats and values of columns found in multiple datasets. For example, the field “loan_channel” has been mapped from many unique inputs across datasets to a set of standardized values. In the images below, the relationships between two datasets and loan_channel are highlighted. The relationship key-value pairs contain a list of “mapped_values” identifying the initial values from the raw data that have been transformed. The dataset on the left contains the list: [“BROKER”, “CORRESPONDENT”, “RETAIL”] Graph 5 While the dataset on the right contains: [“R”, “B”, “C”, “T”, “9”] Graph 6 We can easily merge these lists with a node containing a map of all the recognized enumerations for the field. This central repository of truth allows us to deploy easy and robust changes to the ETL processes for all datasets. It also allows analysts to easily query information related to data availability, formats, and values. Graph 7 In addition to queries specific to a column, this structure allows an analyst to answer questions about data availability across datasets with ease. Normally, comparing pdf data dictionaries, excel worksheets, or database tables can be a painstaking process. Using the graph database, however, a simple query can return the intersection of three datasets as shown below. The resulting graph is easy to analyze and use to define the steps required to obtain and manipulate the data. Graph 8 In addition to these benefits for analysts and end users, utilizing graph database technology for data management comes with benefits from a data governance perspective. Within the realm of data stewardship, ownership and accountability of datasets can be assigned and managed within a graph database like the one in this blog. The ability to store any attribute in a node and create any desired relationship makes it simple to add nodes representing data owners and curators connected to their respective datasets. Graph 9 The ease and transparency with which any data related information can be stored makes graph databases very attractive. Graph databases can also support a nearly infinite number of nodes and relationships while also remaining fast. While every technology has a learning curve, the intuitive nature of graphs combined with their flexibility makes them an intriguing and viable option for data management.


A Brief Introduction to Agile Philosophy

Reducing time to delivery by developing in smaller incremental chunks and incorporating an ability to pivot is the cornerstone of Agile software development methodology.

“Agile” software development is a rarity among business buzz words in that it is actually a fitting description of what it seeks to accomplish. Optimally implemented, it is capable of delivering value and efficiency to business-IT partnerships by incorporating flexibility and an ability to pivot rapidly when necessary.

As a technology company with a longstanding management consulting pedigree, RiskSpan values the combination of discipline and flexibility inherent to Agile development and regularly makes use of the philosophy in executing client engagements. Dynamic economic environments contribute to business priorities that are seemingly in a near-constant state of flux. In response to these ever-evolving needs, clients seek to implement applications and application feature changes quickly and efficiently to realize business benefits early.

This growing need for speed and “agility” makes Agile software development methods an increasingly appealing alternative to traditional “waterfall” methodologies. Waterfall approaches move in discrete phases—treating analysis, design, coding, and testing as individual, stand-alone components of a software project. Historically, when the cost of changing plans was high, such a discrete approach worked best. Nowadays, however, technological advances have made changing the plan more cost-feasible. In an environment where changes can be made inexpensively, rigid waterfall methodologies become unnecessarily counterproductive for at least four reasons:

  1. When a project runs out of time (or money), individual critical phases—often testing—must be compressed, and overall project quality suffers.
  2. Because working software isn’t produced until the very end of the project, it is difficult to know whether the project is really on track prior to project completion.
  3. Not knowing whether established deadlines will be met until relatively late in the game can lead to schedule risks.
  4. Most important, discrete phase waterfalls simply do not respond well to the various ripple effects created by change.

 

Continuous Activities vs. Discrete Project Phases

Agile software development methodologies resolve these traditional shortcomings by applying techniques that focus on reducing overhead and time to delivery. Instead of treating fixed development stages as discrete phases, Agile treats them as continuous activities. Doing things simultaneously and continuously—for example, incorporating testing into the development process from day one—improves quality and visibility, while reducing risk. Visibility improves because being halfway through a project means that half of a project’s features have been built and tested, rather than having many partially built features with no way of knowing how they will perform in testing. Risk is reduced because feedback comes in from earliest stages of development and changes without paying exorbitant costs. This makes everybody happy.

 

Flexible but Controlled

Firms sometimes balk at Agile methods because of a tendency to equate “flexibility” and “agility” with a lack of organization and planning, weak governance and controls, and an abandonment of formal documentation. This, however, is a misconception. “Agile” does not mean uncontrolled—on the contrary, it is no more or less controlled than the existing organizational boundaries of standardized processes into which it is integrated. Most Agile methods do not advocate any particular methodology for project management or quality control. Rather, their intent is on simplifying the software development approach, embracing changing business needs, and producing working software as quickly as possible. Thus, Agile frameworks are more like a shell which users of the framework have full flexibility to customize as necessary.

 

Frameworks and Integrated Teams

Agile methodologies can be implemented using a variety of frameworks, including Scrum, Kanban, and XP. Scrum is the most popular of these and is characterized by producing a potentially shippable set of functionalities at the end of every iteration in two-week time boxes called sprints. Delivering high-quality software at the conclusion of such short sprints requires supplementing team activities with additional best practices, such as automated testing, code cleanup and other refactoring, continuous integration, and test-driven or behavior-driven development.

Agile teams are built around motivated individuals subscribing what is commonly referred to as a “lean Agile mindset.” Team members who embrace this mindset share a common vision and are motivated to contribute in ways beyond their defined roles to attain success. In this way, innovation and creativity is supported and encouraged. Perhaps most important, Agile promotes building relationships based on trust among team members and with the end-user customer in providing fast and high-quality delivery of software. When all is said and done, this is the aim of any worthwhile endeavor. When it comes to software development, Agile is showing itself to be an impressive means to this end.


Advantages and Disadvantages of Open Source Data Modeling Tools

Using open source data modeling tools has been a topic of debate as large organizations, including government agencies and financial institutions, are under increasing pressure to keep up with technological innovation to maintain competitiveness. Organizations must be flexible in development and identify cost-efficient gains to reach their organizational goals, and using the right tools is crucial. Organizations must often choose between open source software, i.e., software whose source code can be modified by anyone, and closed software, i.e., proprietary software with no permissions to alter or distribute the underlying code.

Mature institutions often have employees, systems, and proprietary models entrenched in closed source platforms. For example, SAS Analytics is a popular provider of proprietary data analysis and statistical software for enterprise data operations among financial institutions. But several core computations SAS performs can also be carried out using open source data modeling tools, such as Python and R. The data wrangling and statistical calculations are often fungible and, given the proper resources, will yield the same result across platforms.

Open source is not always a viable replacement for proprietary software, however. Factors such as cost, security, control, and flexibility must all be taken into consideration. The challenge for institutions is picking the right mix of platforms to streamline software development.  This involves weighing benefits and drawbacks.

Advantages of Open Source Programs

The Cost of Open Source Software

The low cost of open source software is an obvious advantage. Compared to the upfront cost of purchasing a proprietary software license, using open source programs seems like a no-brainer. Open source programs can be distributed freely (with some possible restrictions to copyrighted work), resulting in virtually no direct costs. However, indirect costs can be difficult to quantify. Downloading open source programs and installing the necessary packages is easy and adopting this process can expedite development and lower costs. On the other hand, a proprietary software license may bundle setup and maintenance fees for the operational capacity of daily use, the support needed to solve unexpected issues, and a guarantee of full implementation of the promised capabilities. Enterprise applications, while accompanied by a high price tag, provide ongoing and in-depth support of their products. The comparable cost of managing and servicing open source programs that often have no dedicated support is difficult to determine.

Open Source Talent Considerations

Another advantage of open source is that it attracts talent who are drawn to the idea of sharable and communitive code. Students and developers outside of large institutions are more likely to have experience with open source applications since access is widespread and easily available. Open source developers are free to experiment and innovate, gain experience, and create value outside of the conventional industry focus. This flexibility naturally leads to more broadly skilled inter-disciplinarians. The chart below from Indeed’s Job Trend Analytics tool reflects strong growth in open source talent, especially Python developers.

From an organizational perspective, the pool of potential applicants with relevant programming experience widens significantly compared to the limited pool of developers with closed source experience. For example, one may be hard-pressed to find a new applicant with development experience in SAS since comparatively few have had the ability to work with the application. Key-person dependencies become increasingly problematic as the talent or knowledge of the proprietary software erodes down to a shrinking handful of developers.

Job Seekers Interests via Indeed

*Indeed searches millions of jobs from thousands of job sites. The jobseeker interest graph shows the percentage of jobseekers who have searched for SAS, R, and python jobs.

*Indeed searches millions of jobs from thousands of job sites. The jobseeker interest graph shows the percentage of jobseekers who have searched for SAS, R, and python jobs.

Support and Collaboration

The collaborative nature of open source facilitates learning and adapting to new programming languages. While open source programs are usually not accompanied by the extensive documentation and user guides typical of proprietary software, the constant peer review from the contributions of other developers can be more valuable than a user guide. In this regard, adopters of open source may have the talent to learn, experiment with, and become knowledgeable in the software without formal training.

Still, the lack of support can pose a challenge. In some cases, the documentation accompanying open source packages and the paucity of usage examples in forums do not offer a full picture. For example, RiskSpan built a model in R that was driven by the available packages for data infrastructure – a precursor to performing statistical analysis – and their functionality. R does not have an active support solutions line and the probability of receiving a response from the author of the package is highly unlikely. This required RiskSpan to thoroughly vet packages.

Flexibility and Innovation

Another attractive feature of open source is its inherent flexibility. Python allows users to use different integrated development environments (IDEs) that have multiple different characteristics or functions, as compared to SAS Analytics, which only provides SAS EG or Base SAS. R makes possible web-based interfaces for server-based deployments. These functionalities grant more access to users at a lower cost. Thus, there can be more firm-wide development and participation in development. The ability to change the underlying structure of open source makes it possible to mold it per the organization’s goals and improve efficiency.

Another advantage of open source is the sheer number of developers trying to improve the software by creating many functionalities not found in their closed source equivalent. For example, R and Python can usually perform many functions like those available in SAS, but also have many capabilities not found in SAS: downloading specific packages for industry specific tasks, scraping the internet for data, or web development (Python). These specialized packages are built by programmers seeking to address the inefficiencies of common problems. A proprietary software vendor does not have the expertise nor the incentive to build equivalent specialized packages since their product aims to be broad enough to suit uses across multiple industries.

RiskSpan uses open source data modeling tools and operating systems for data management, modeling, and enterprise applications. R and Python have proven to be particularly cost effective in modeling. R provides several packages that serve specialized techniques. These include an archive of packages devoted to estimating the statistical relationship among variables using an array of techniques, which cuts down on development time. The ease of searching for these packages, downloading them, and researching their use incurs nearly no cost.

Open source makes it possible for RiskSpan to expand on the tools available in the financial services space. For example, a leading cash flow analytics software firm that offers several proprietary solutions in modeling structured finance transactions lacks the full functionality RiskSpan was seeking.  Seeking to reduce licensing fees and gain flexibility in structuring deals, RiskSpan developed deal cashflow programs in Python for STACR, CAS, CIRT, and other consumer lending deals. The flexibility of Python allowed us to choose our own formatted cashflows and build different functionalities into the software. Python, unlike closed source applications, allowed us to focus on innovating ways to interact with the cash flow waterfall.

Disadvantages of Open Source Programs

Deploying open source solutions also carries intrinsic challenges. While users may have a conceptual understanding of the task at hand, knowing which tools yield correct results, whether derived from open or closed source, is another dimension to consider. Different parameters may be set as default, new limitations may arise during development, or code structures may be entirely different. Different challenges may arise from translating a closed source program to an open source platform. Introducing open source requires new controls, requirements, and development methods.

Redundant code is an issue that might arise if a firm does not strategically use open source. Across different departments, functionally equivalent tools may be derived from distinct packages or code libraries. There are several packages offering the ability to run a linear regression, for example. However, there may be nuanced differences in the initial setup or syntax of the function that can propagate problems down the line. In addition to the redundant code, users must be wary of “forking” where the development community splits on an open source application. For example, R develops multiple packages performing the same task/calculations, sometimes derived from the same code base, but users must be cognizant that the package is not abandoned by developers.

Users must also take care to track the changes and evolution of open source programs. The core calculations of commonly used functions or those specific to regular tasks can change. Maintaining a working understanding of these functions in the face of continual modification is crucial to ensure consistent output. Open source documentation is frequently lacking. In financial services, this can be problematic when seeking to demonstrate a clear audit trail for regulators. Tracking that the right function is being sourced from a specific package or repository of authored functions, as opposed to another function, which may have an identical name, sets up blocks on unfettered usage of these functions within code. Proprietary software, on the other hand, provides a static set of tools, which allows analysts to more easily determine how legacy code has worked over time.

Using Open Source Data Modeling Tools

Deciding on whether to go with open source programs directly impacts financial services firms as they compete to deliver applications to the market. Open source data modeling tools are attractive because of their natural tendency to spur innovation, ingrain adaptability, and propagate flexibility throughout a firm. But proprietary software solutions are also attractive because they provide the support and hard-line uses that may neatly fit within an organization’s goals. The considerations offered here should be weighed appropriately when deciding between open source and proprietary data modeling tools.

Questions to consider before switching platforms include:

  • How does one quantify the management and service costs for using open source programs? Who would work on servicing it, and, once all-in expenses are considered, is it still more cost-effective than a vendor solution?
  • When might it be prudent to move away from proprietary software? In a scenario where moving to a newer open source technology appears to yield significant efficiency gains, when would it make sense to end terms with a vendor?
  • Does the institution have the resources to institute new controls, requirements, and development methods when introducing open source applications?
  • Does the open source application or function have the necessary documentation required for regulatory and audit purposes?

Open source is certainly on the rise as more professionals enter the space with the necessary technical skills and a new perspective on the goals financial institutions want to pursue. As competitive pressures mount, financial institutions are faced with a difficult yet critical decision of whether open source is appropriate for them. Open source may not be a viable solution for everyone—the considerations discussed above may block the adoption of open source for some organizations. However, often the pros outweigh the cons, and there are strategic precautions that can be taken to mitigate any potential risks.


References

 https://www.redhat.com/en/open-source/open-source-way

http://www.stackoverflow.blog/code-for-a-living/how-i-open-sourced-my-way-to-my-dream-job-mohamed-said

https://www.redhat.com/f/pdf/whitepapers/WHITEpapr2.pdf

http://www.forbes.com/sites/benkepes/2013/10/02/open-source-is-good-and-all-but-proprietary-is-still-winning/#7d4d544059e9

https://www.indeed.com/jobtrends/q-SAS-q-R-q-python.html


Open Source Software for Mortgage Data Analysis

While open source has been around for decades, using open source software for mortgage data analysis is a recent trend. Financial institutions have traditionally been slow to adopt the latest data and technology innovations due to the strict regulatory and risk-averse nature of the industry, and open source has been no exception. As open source becomes more mainstream, however, many of our clients have come to us with questions regarding its viability within the mortgage industry.

The short answer is simple: open source has a lot of potential for the financial services and mortgage industries, particularly for data modeling and data analysis. Within our own organization, we frequently use open source data modeling tools for our proprietary models as well as models built for clients. While a degree of risk is inherent, prudent steps can be taken to mitigate them and profit from the many worthwhile benefits of open source.

Open source has a lot of potential for the mortgage industry, particularly for data modeling & analysis @RiskSpan (Click to Tweet)

To address the common concerns that arise with open source, we’ll be publishing a series of blog posts aimed at alleviating these concerns and providing guidelines for utilizing open source software for data analysis within your organization. Some of the questions we’ll address include:

  • Can open source programming languages be applied to mortgage data modeling and data analysis?
  • What risks does open source expose me to and what can I do to mitigate them?
  • What are the pitfalls of open source and do the benefits outweigh them?
  • How does using open source software for mortgage data analysis affect the control and governance of my models?
  • What factors do I need to consider when deciding whether to use open source at my institution?

Throughout the series, we’ll also include examples of how RiskSpan has used open source software for mortgage data analysis, why we chose to use it, and what factors were considered. Before we dive in on the considerations for open source, we thought it would be helpful to offer an introduction to open source and provide some context around its birth and development within the financial industry.

What Is Open Source Software?

Software has conventionally been considered open source when the original code is made publicly available so that anyone can edit, enhance, or modify it freely. This original concept has recently been expanded to incorporate a larger movement built on values of collaboration, transparency, and community.

Open Source Software Vs Proprietary Software

Proprietary software refers to applications for which the source code is only accessible to those who created it. Thus, only the original author(s) has control over any updates or modifications. Outside players are barred from even viewing the code to protect the owners from copying and theft. To use proprietary software, users agree to a licensing agreement and typically pay a fee. The agreement legally binds the user to the owners’ terms and prevents the user from any actions the owners have not expressly permitted.

Open source software, on the other hand, gives any user free rein to view, copy, or modify it. The idea is to foster a community built on collaboration, allowing users to learn from each other and build on each other’s work. Like with proprietary software, open source users must still agree to a licensing agreement, but the terms are very differ significantly from those of a proprietary license.1

History of Open Source Software

The idea of open source software first developed in the 1950s, when much of software development was done by computer scientists in higher education. In line with the value of sharing knowledge among academics, source code was openly accessible. By the 1960s, however, as the cost of software development increased, hardware companies were charging additional fees for software that used to be bundled with their products.

Change came again in the 1980s. At this point, it was clear that technology and software were important factors of the growing business economy. Technology leaders were frustrated with the increasing costs of software. In 1984, Richard Stallman launched the GNU Project with the purpose of creating a complete computer operating system with no limitations on the use of its source code. In 1991, the operating system now referred to as Linux was released.

The final tipping point came in 1997, when Eric Raymond published his book, The Cathedral and the Bazaar, in which he articulated the underlying principles behind open source software. His book was a driving factor in Netscape’s decision to release its source code to the public, inspired by the idea that allowing more people to find and fix bugs will improve the system for everyone. Following Netscape’s release, the term “open source software” was introduced in 1998.

In the data-driven economy of the past two decades, open source has played an ever-increasing role. The field of software development has evolved to embrace the values of open source. Open source has made it not only possible but easy for anyone to access and manipulate source code, improving our ability to create and share valuable software.2

Adoption of Open Source Software in Business

The growing relevance of open source software has also changed the way large organizations approach their software solutions. While open source software was at one point rare in an enterprise’s system, it’s now the norm. A survey conducted by Black Duck Software revealed that fewer than 3% of companies don’t rely on open source at all. Even the most conservative organizations are hopping on board the open source trend.3
Even the most conservative organizations are hopping on board the open source trend.

In a blog post from June 2016, TechCrunch writes:

“Open software has already rooted itself deep within today’s Fortune 500, with many contributing back to the projects they adopt. We’re not just talking stalwarts like Google and Facebook; big companies like Walmart, GE, Merck, Goldman Sachs — even the federal government — are fleeing the safety of established tech vendors for the promises of greater control and capability with open software. These are real customers with real budgets demanding a new model of software.”4

The expected benefits of open source software are alluring all types of institutions, from small businesses, to technology giants, to governments. This shift away from proprietary software in favor of open source is streamlining business operations. As more companies make the switch, those who don’t will fall behind the times and likely be at a serious competitive disadvantage.

Open Source Software for Mortgage Data Analysis

Open source software is slowly finding its way into the financial services industry as well. We’ve observed that smaller entities that don’t have the budgets to buy expensive proprietary software have been turning to open source as a viable substitute. Smaller companies are either building software in house or turning to companies like RiskSpan to achieve a cost-effective solution. On the other hand, bigger companies with the resources to spare are also dabbling in open source. These companies have the technical expertise in house and give their skilled workers the freedom to experiment with open source software.

Within our own work, we see tremendous potential for open source software for mortgage data analysis. Open source data modeling tools like Python, R, and Julia are useful for analyzing mortgage loan and securitization data and identifying historical trends. We’ve used R to build models for our clients and we’re not the only ones: several of our clients are now building their DFAST challenger models using R.

Open source has grown enough in the past few years that more and more financial institutions will make the switch. While the risks associated with open source software will continue to give some organizations pause, the benefits of open source will soon outweigh those concerns. It seems open source is a trend that is here to stay, and luckily, it is a trend ripe with opportunity.


[1] https://opensource.com/resources/what-open-source

[2] https://www.longsight.com/learning-center/history-open-source

[3] https://techcrunch.com/2016/06/19/the-next-wave-in-software-is-open-adoption-software/

[4] https://techcrunch.com/2016/06/19/the-next-wave-in-software-is-open-adoption-software/


Balancing Internal and External Model Validation Resources

The question of “build versus buy” is every bit as applicable and challenging to model validation departments as it is to other areas of a financial institution. With no “one-size-fits-all” solution, banks are frequently faced with a balancing act between the use of internal and external model validation resources. This article is a guide for deciding between staffing a fully independent internal model validation department, outsourcing the entire operation, or a combination of the two.

Striking the appropriate balance is a function of at least five factors:

  1. control and independence
  2. hiring constraints
  3. cost
  4. financial risk
  5. external (regulatory, market, and other) considerations

Control and Independence

Internal validations bring a measure of control to the operation. Institutions understand the specific skill sets of their internal validation team beyond their resumes and can select the proper team for the needs of each model. Control also extends to the final report, its contents, and how findings are described and rated.

Further, the outcome and quality of internal validations may be more reliable. Because a bank must present and defend validation work to its regulators, low-quality work submitted by an external validator may need to be redone by yet another external validator, often on short notice, in order to bring the initial external model validation up to spec.

Elements of control, however, must sometimes be sacrificed in order to achieve independence. Institutions must be able to prove that the validator’s interests are independent from the model validation outcomes. While larger banks frequently have large, freestanding internal model validation departments whose organizational independence is clear and distinct, quantitative experts at smaller institutions must often wear multiple hats by necessity.

Ultimately the question of balancing control and independence can only be suitably addressed by determining whether internal personnel qualified to perform model validations are capable of operating without any stake in the outcome (and persuading examiners that this is, in fact, the case).

Hiring Constraints

Practically speaking, hiring constraints represent a major consideration. Hiring limitations may result from budgetary or other less obvious factors. Organizational limits aside, it is not always possible to hire employees with a needed skill set at a workable salary range at the time when they are needed. For smaller banks with limited bandwidth or larger banks that need to further expand, external model validation resources may be sought out of sheer necessity.

Cost

Cost is an important factor that can be tricky to quantify. Model validators tend to be highly specialized; many typically work on one type of model, for example, Basel models. If your bank is large enough and has enough Basel models to keep a Basel model validator busy with internal model validations all year, then it may be cost effective to have a Basel model validator on staff. But if your Basel model validator is only busy for six months of the year, then a full-time Basel validator is only efficient if you have other projects that are suited to that validator’s experience and cost. To complicate things further, an employee’s cost is typically housed in one department, making it difficult from a budget perspective to balance an employee’s time and cost across departments.

If we were building a cost model to determine how many internal validators we should hire, the input variables would include:

  1. the number of models in our inventory
  2. the skills required to validate each model
  3. the risk classification of each model (i.e., how often validations must be completed)
  4. the average fully loaded salary expense for a model validator with those specific skills

Only by comparing the cost of external validations to the year-round costs associated with hiring personnel with the specialized knowledge required to validate a given type of model (e.g., credit models, market risk models, operational risk models, ALM models, Basel models, and BSA/AML models) can a bank arrive at a true apples-to-apples comparison.

Financial Risk

While cost is the upfront expense of internal or external model validations, financial risk accounts for the probability of unforeseen circumstances. Assume that your bank is staffed with internal validators and your team of internal validators can handle the schedule of model validations (validation projects are equally spaced throughout the year). However, operations may need to deploy a new version of a model or a new model on a schedule that requires a validation at a previously unscheduled time with no flexibility. In this case, your bank may need to perform an external validation in addition to managing and paying a fully-staffed team of internal validators.

A cost model for determining whether to hire additional internal validators should include a factor for the probability that models will need to be validated off-schedule, resulting in unforeseen external validation costs. On the other hand, a cost model might also consider the probability that an external validator’s product will be inferior and incur costs associated with required remediation.

External Risks

External risks are typically financial risks caused by regulatory, market, and other factors outside an institution’s direct control. The risk of a changing regulatory environment under a new presidential administration is always real and uncertainty clearly abounds as market participants (and others) attempt to predict President Trump’s priorities. Changes may include exemptions for regional banks from certain Dodd-Frank requirements; the administration has clearly signaled its intent to loosen regulations generally. Even though model validation will always be a best practice, these possibilities may influence a bank’s decision to staff an internal model validation team.

Recent regulatory trends can also influence validator hiring decisions. For example, our work with various banks over the past 12-18 months has revealed that regulators are trending toward requiring larger sample sizes for benchmarking and back-testing. Given the significant effort already associated with these activities, larger sample sizes could ultimately lower the number of model validations internal resources can complete per year. Funding external validations may become more expensive, as well.

Another industry trend is the growing acceptance of limited-scope validations. If only minimal model changes have occurred since a prior validation, the scope of a scheduled validation may be limited to the impact of these changes. If remediation activities were required by a prior validation, the scope may be limited to confirming that these changes were effectively implemented. This seemingly common-sense approach to model validations by regulators is a welcome trend and could reduce the number of internal and external validations required.

Joint Validations

In addition to reduced-scope validations, some of our clients have sought to reduce costs by combining internal and external resources. This enables institutions to limit hiring to validators without model-specific or highly quantitative skills. Such internal validators can typically validate a large number of lower-risk, less technical models independently.

For higher-risk, more technical models, such as ALM models, the internal validator may review the controls and documentation sufficiently, leaving the more technical portions of the validation—conceptual soundness, process verification, benchmarking, and back-testing, for example—to an external validator with specific expertise. In these cases, reports are produced jointly with internal and external validators each contributing the sections pertaining to procedures that they performed.

The resulting report often has the dual benefit of being more economical than a report generated externally and more defensible than one that relies solely on internal resources who may lack the specific domain knowledge necessary.

Conclusion

Model risk managers have limited time, resources, and budgets and face unending pressure from management and regulators. Striking an efficient resource-balancing strategy is critically important to consistently producing high-quality model validation reports on schedule and within budgets. The question of using internal vs. external model validation resources is not an either/or proposition. In considering it, we recommend that model risk management (MRM) professionals

  • consider the points above and initiate risk tolerance and budget conversations within the MRM framework.
  • reach out to vendors who have the skills to assist with your high-risk models, even if there is not an immediate need. Some institutions like to try out a model validation provider on a few low- or moderate-risk models to get a sense of their capabilities.
  • consider internal staffing to meet basic model validation needs and external vendors (either for full validations or outsourced staff) to fill gaps in expertise.

RDARR: Principles for Effective Risk Data Aggregation and Risk Reporting

Background and Impetus for RDARR

The global financial crisis revealed that many banks had inadequate practices for timely, complete, and accurate aggregation of risk exposures.  These limitations impaired their ability to generate reliable information to manage risks, especially during times of economic stress. These limitations resulted in severe consequences to individual banks and the entire financial system.

Whether or not your bank is designated as an SIB, we expect your regulator to apply the Principles. You may wish to proactively enhance your RDARR. RiskSpan’s RDARR Advisory Services team has decades of finance, accounting, data, and technology expertise to help banks meet these increasing supervisory expectations.

Responding to this pervasive systemic issue, the Basel Committee on Banking Supervision (BCBS) issued the “Principles for Effective Risk Data Aggregation and Risk Reporting” (RDARR).

Objectives of RDARR

The BCBS RDARR prescribes principles (the Principles) with the objective of strengthening risk data aggregation capabilities and internal risk reporting practices. Implementation of the Principles is expected to enhance risk management and decision-making processes in order to:

  • Enhance infrastructure for reporting key information, particularly that used by the board and senior management to identify, monitor and manage risks;
  • Improve decision-making processes;
  • Enhance the management of information across legal entities, while facilitating a comprehensive assessment of risk exposures at a consolidated level;
  • Reduce the probability and severity of losses resulting from risk management weaknesses;
  • Improve the speed at which information is available and hence decisions can be made; and
  • Improve the organization’s quality of strategic planning and the ability to manage the risk of new products and services.

The Principles of RDARR

Fourteen Principles are structured in four sections:

Overarching governance and infrastructure

1. Governance
2. Architecture/ Infrastructure

Risk data aggregation capabilities

3. Data Accuracy and Integrity
4. Completeness
5. Timeliness
6. Adaptability

Risk reporting practices

7.  Reports Accuracy
8.  Comprehensiveness
9.  Clarity and Usefulness
10.  Frequency
11.  Distribution

Supervisory review, tools and cooperation

12.  Review
13.  Remediation
14.  Cooperation

The BCBS prescribes requirements and practices for each Principle that define compliance.

Scope of RDARR

The Principles are initially prescribed to systemically important banks (SIBs) as designated by the international Financial Stability Board (FSB). Initially, they were expected to be fully implemented by January 1, 2016.

The BCBS “strongly” suggests that supervisory bodies apply the Principles to a wider range of banks, proportionate to the size, nature, and complexity of these banks’ operations.

Consistent with other recent supervisory pronouncements, we expect these principles to eventually be applied by other regulators.

Progress in Adopting RDARR

The BCBS has conducted multiple self-assessment surveys of SIBs to measure preparedness for compliance with the Principles and identify common challenges, along with potential strategies for compliance.

The survey results indicate many banks continue to encounter difficulties in establishing strong data aggregation governance, architecture and processes, often relying on manual workarounds. Many banks failed to recognize that governance/infrastructure practices are important prerequisites for facilitating compliance with the Principles.

Many banks indicated that they will be unable to comply with at least one Principle by the January 2016 deadline.

Impact of the Principles

This guidance has increased the required capabilities of RDARR for measuring and reporting risks.

The new paradigm for risk data aggregation and risk reporting imposes many new standards, most notably:

  • A bank’s senior management should be fully aware of and understand the limitations that prevent full risk data aggregation.
  • Controls surrounding risk data need to be as robust as those applicable to accounting data.
  • Risk data should be reconciled with source systems, including accounting data where appropriate, to ensure that the risk data is accurate.
  • A bank should strive towards a single authoritative source for risk data per each type of risk.
  • Supervisors expect banks to document and explain all of their risk data aggregation processes whether automated or manual.
  • Supervisors expect banks to consider accuracy requirements analogous to accounting materiality.
  • Due to the wide and comprehensive scope of RDARR Principles, many SIBs have struggled to identify and implement the enhancements to facilitate full compliance.

Examples of RiskSpan RDARR Assistance Include:

  • Interpret Principles and Requirements – Interpret the Principles and their application to your existing risk, data, risk reporting, IT infrastructure, data architecture, and quality.
  • Assess Current Capabilities – Assess your existing risk data, risk reporting, IT infrastructure, data architecture, and data quality to identify gaps in the capabilities prescribed by the Principles.
  • Develop and Implement Remediation – Develop and implement remediation plans to eliminate gaps and facilitate compliance.
  • Develop and Implement Standard Risk Taxonomies – Develop standard risk taxonomies to meet the needs for risk reporting, regulatory compliance.
  • Develop or Enhance Risk Reporting – Develop automated risk reporting dashboards for market, credit, and operational risk that are supported by reliable source data.
  • Document and Assess End State RDARR – Develop good documentation of the end state to demonstrate compliance to regulators.

RiskSpan RDARR Advisory Services

Whether or not your bank is designated as a SIB, recent trends indicate that your regulator may soon expect you to apply the Principles. You will need to pro-actively enhance your RDARR.

The Basel Committee on Banking Supervision Principles for Effective Risk Data Aggregation and Risk Reporting guidance has increased the burden on you for measuring and reporting risks.  This new paradigm for risk data aggregation and risk reporting imposes many new standards.

RiskSpan’s RDARR Advisory Services team has decades of finance, accounting, data, and technology expertise to help banks meet these increasing supervisory expectations.


About The Author

Steve Sloan, Director, CPA, CIA, CISA, CIDA, has extensive experience in the professional practices of risk management and internal audit, collaborating with management and audit committees to design and implement the infrastructures to obtain the required assurances over risk and controls.

He prescribes a disciplined approach, aligning stakeholders’ expectations with leading practices, to maximize the return on investment in risk functions. Steve holds a Bachelor of Science from Pennsylvania State University and has multiple certifications.


Get Started
Log in

Linkedin   

risktech2024