Making Data Dictionaries Beautiful Using Graph Databases
Most analysts estimate that for a given project well over half of the time is spent on collecting, transforming, and cleaning data in preparation for analysis. This task is generally regarded as one of the least appetizing portions of the data analysis process and yet it is the most crucial, as trustworthy analyses are borne out of clean, reliable data. Gathering and preparing data for analysis can be either enhanced or hindered based on the data management practices in place at a firm. When data are readily available, clearly defined, and well documented it will lead to faster and higher-quality insights. As the size and variability of data grows, however, so too does the challenge of storing and managing it. Like many firms, RiskSpan manages a multitude of large, complex datasets with varying degrees of similarity and connectedness. To streamline the analysis process and improve the quantity and quality of our insights, we have made our datasets, their attributes, and relationships transparent and quickly accessible using graph database technology. Graph databases differ significantly from traditional relational databases because data are not stored in tables. Instead, data are stored in either a node or a relationship (also called an edge), which is a connection between two nodes. The image below contains a grey node labeled as a dataset and a blue node labeled as a column. The line connecting these two nodes is a relationship which, in this instance, signifies that the dataset contains the column. There are many advantages to this data structure including decreased redundancy. Rather than storing the same “Column1” in multiple tables for each dataset that contain it (as you would in a relational database), you can simply create more relationships between the datasets demonstrated below: With this flexible structure it is possible to create complex schema that remain visually intuitive. In the image below the same grey (dataset) -contains-> blue (column) format is displayed for a large collection of datasets and columns. Even at such a high level, the relationships between datasets and columns reveal patterns about the data. Here are three quick observations:
- In the top right corner there is a dataset with many unique columns.
- There are two datasets that share many columns between them and have limited connectivity to the other datasets.
- Many ubiquitous columns have been pulled to the center of the star pattern via the relationships to the multiple datasets on the outer rim.