Data issues plague organizations of all sorts and sizes. But generally, the bigger the dataset, and the more transformations the data goes through, the greater the likelihood of problems. Organizations take in data from many different sources, including social media, third-party vendors and other structured and unstructured origins, resulting in massive and complex data storage and management challenges. This post presents ideas to keep in mind when seeking to address these. First, a couple of definitions: Data quality generally refers to the fitness of a dataset for its purpose in a given context. Data quality encompasses many related aspects, including: Accuracy, Completeness, Update status, Relevance, Consistency across data sources, Reliability, Appropriateness of presentation, and Accessibility Data lineage tracks data movement, including its origin and where it moves over time. Data lineage can be represented visually to depict how data flows from its source to its destination via various changes and hops. The challenges facing many organizations relate to both data quality and data lineage issues, and a considerable amount of time and effort is spent both in tracing the source of data (i.e., its lineage) and correcting errors (i.e., ensuring its quality). Business intelligence and data visualization tools can do a magnificent job of teasing stories out of data, but these stories are only valuable when they are true. It is becoming increasingly vital to adopt best practices to ensure that the massive amounts of data feeding downstream processes and presentation engines are both reliable and properly understood. Financial institutions must frequently deal with disparate systems either because of mergers and acquisitions or in order to support different product types—consumer lending, commercial banking and credit cards, for example. Disparate systems tend to result in data silos, and substantial time and effort must go into providing compliance reports and meeting the various regulatory requirements associated with analyzing data provenance (from source to destination). Understanding the workflow of data and access controls around security are also vital applications of data lineage and help ensure data quality. In addition to the obvious need for financial reporting accuracy, maintaining data lineage and quality is vital to identifying redundant business rules and data and to ensuring that reliable, analyzable data is constantly available and accessible. It also helps to improve the data governance echo system, enabling data owners to focus on gleaning business insights from their data rather than focusing attention on rectifying data issues. Common Data Lineage Issues A surprising number of data issues emerge simply from uncertainty surrounding a dataset’s provenance. Many of the most common data issues stem from one or more of the following categories: Human error: “Fat fingering” is just the tip of the iceberg. Misconstruing and other issues arising from human intervention are at the heart of virtually all data issues. Incomplete Data: Whether it’s drawing conclusions based on incomplete data or relying on generalizations and judgment to fill in the gaps, many data issues are caused by missing data. Data format: Systems expect to receive data in a certain format. Issues arise when the actual input data departs from these expectations. Data consolidation: Migrating data from legacy systems or attempting to integrate newly acquired data (from a merger, for instance) frequently leads to post-consolidation issues. Data processing: Calculation engines, data aggregators, or any other program designed to transform raw data into something more “usable” always run the risk of creating output data with quality issues. Addressing Issues Issues relating to data lineage and data quality are best addressed by employing some combination of the following approaches. The specific blend of approaches depends on the types of issues and data in question, but these principles are broadly applicable. Employing a top-down discovery approach enables data analysts to understand the key business systems and business data models that drive an application. This approach is most effective when logical data models are linked to the physical data and systems. Creating a rich metadata repository for all the data elements flowing from the source to destination can be an effective way of heading off potential data lineage issues. Because data lineage is dependent on the metadata information, creating a robust repository from the outset often helps preserve data lineage throughout the life cycle. Imposing useful data quality rules is an important element in establishing a framework in which data is always validated against a set of well-conceived business rules. Ensuring not only that data passes comprehensive rule sets but also that remediation factors are in place for appropriately dealing with data that fails quality control checks is crucial for ensuring end-to-end data quality. Data lineage and data quality both require continuous monitoring by a defined stewardship council to ensure that data owners are taking appropriate steps to understand and manage the idiosyncrasies of the datasets they oversee. Our Data Lineage and Data Quality Background RiskSpan’s diverse client base includes several large banks (with we define as banks with assets totaling in excess of $50 billion). Large banks are characterized by a complicated web of departments and sub-organizations, each offering multiple products, sometimes to the same base of customers. Different sub-organizations frequently rely on disparate systems (sometimes due to mergers/acquisitions; sometimes just because they develop their businesses independent of one another). Either way, data silos inevitably result. RiskSpan has worked closely with chief data officers of large banks to help establish data stewardship teams charged with taking ownership of the various “areas” of data within the bank. This involves the identification of data “curators” within each line of business to coordinate with the CDO’s office and be the advocate (and ultimately the responsible party) for the data they “own.” In best practice scenarios, a “data curator” group is formed to facilitate collaboration and effective communication for data work across the line of business. We have found that a combination of top-down and bottom-up data discovery approaches is most effective when working accross stakeholders to understand existing systems and enterprise data assets. RiskSpan has helped create logical data flow diagrams (based on the top-down approach) and assisted with linking physical data models to the logical data models. We have found Informatica and Collibra tools to be particularly useful in creating data lineage, tracking data owners, and tracing data flow from source to destination. Complementing our work with financial clients to devise LOB-based data quality rules, we have built data quality dashboards using these same tools to enable data owners and curators to rectify and monitor data quality issues. These projects typically include elements of the following components. Initial assessment review of the current data landscape. Establishment of a logical data flow model using both top-down and bottom-up data discovery approaches. Coordination with the CDO / CIO office to set up a data governance stewardship team and to identify data owners and curators from all parts of the organization. Delineation of data policies, data rules and controls associated with different consumers of the data. Development of a target state model for data lineage and data quality by outlining the process changes from a business perspective. Development of future-state data architecture and associated technology tools for implementing data lineage and data quality. Invitation to client stakeholders to reach a consensus related to future-state model and technology architecture. Creation of a project team to execute data lineage and data quality projects by incorporating the appropriate resources and client stakeholders. Development of a change management and migration strategy to enable users and stakeholders to use data lineage and data quality tools. Ensuring data quality and lineage is ultimately the responsibility of business lines that own and use the data. Because “data management” is not the principal aim of most businesses, it often behooves them to leverage the principles outlined in this post (sometimes along with outside assistance) to implement tactics that will to help ensure that the stories their data tell are reliable.