Data Lineage is a part of Data Management practices for many years now. It has, however, not gained traction until the last decennia. The spark in popularity is mainly attributed to the exponential growth of data volumes in general, the increased amount of different data sources and the need to combine data across them, as well as data regulations increased reporting requirements, in both complexity & granularity. I would argue that without proper Data Lineage no Data Governance initiative can reach its true potential.
There are many Data Lineage definitions out there, but almost all agree that Data Lineage describes the process of how data is created/acquired, processed and its general journey within the organisation. Data Lineage is how we can ensure Data Quality in how we create/handle/share data, build trust internally towards data, promote Data Literacy. And in general, solidify how we can best use data to make better business decisions.
Data Lineage has two major dimensions:
- Business or Horizontal Data Lineage which presents the Data Lifecycle from the Business perspective as information is created, processed, and presented in different formats for various purposes. It provides a bigger high – overview picture of our data flows.
- Technical or Vertical Data Lineage which represents the physical representation of data as it is created in different applications, stored in different systems, imported into different types of data stores, transformed, enhanced before, in the end, being presented to the end user. It provides a more detailed view into one specific part of our data flow.
Based on the intro above one can see the very close correlation between Data Lineage and Data Governance. The main Data Lineage techniques used today are the following:
- Pattern Based Lineage: This technique is focusing on the data itself and tries to find patterns within them based on their metadata. It ignores all transformation code and that makes this approach technology agnostic, but on the other hand it has the disadvantage that it risks missing business logic that resides within the code.
- Tagging Based Lineage: This technique is based upon the assumption that the organisation is operating under one big system which is used to consistently tag data as they are created and flowing through the system. This is a very effective technique as long as data are residing within the specific system but since most organisations are dealing with multiple systems/data sources it is not as popular.
- Parsing Based Lineage: This is the most advanced technique and the most demanding one from its counterparts. This is because the parsing process is based upon understanding the organisations data flows end-to-end, by analysing all the algorithms and tools used to transform the data and building a holistic lineage based on it. The challenge here however is being able to understand and combine the logic built based on all these different tools and programming languages.
Some of the benefits/use cases of Data Linage use are the following:
Regulatory reporting: As mentioned in the beginning, regulatory requirements are becoming more and more complex and demanding. Moreover, they require an ever-increasing level of data granularity that is quite often required even for historical/archived data. By securing a proper Data Lineage in an ever-growing data ecosystem with new technologies and data architectures taking over, organizations can establish data audit trails that can be used to navigate through imposed regulatory requirements.
Data Virtualization Implementations: Especially, in the last few years with the highly accelerated cloud adoption and the rise of hybrid Data Architectures trying to combine both Data Warehouse and Data Lake worlds, we’ve observed a high need for quality data from different sources that can be combined to provide new insights for the business. Such data is often delivered in different formats and levels of structure. Data Lineage acts as a tracking mechanism for the data from all these different systems. It documents the data as they are incepted, when they are transformed and by which process, how they are formatted and most importantly (for this case) when in time they are valid. Creating in this way a detailed data flow map from one end to the other.
Cost & Risk Management: As the data ecosystem of each organisation grows exponentially so is the complexity of the data solutions that are being developed. With the rise in complexity of data shared between different parts of the organisation or even combined across them it becomes increasingly difficult to be able to identify the cause of problems that are discovered. Most organizations do not realise the cost valuation of remedying data issues that are discovered in a daily basis. They spend a huge amount of man-hours as their data personnel trying to manually follow the data cramps from the PowerBI report all the way to the Database table where the data originates. In addition, potential changes in any part of this data flow path are impossible to be properly communicated to all the affected technical or business users.
Improved Business and Operational performance: As data are being increasingly used to either support or enhance business decisions, they also become increasingly reliant on the business in terms of relevance and quality. Every organisation is evolving in one or another way and with it so is its underlying business model and overall goals and priorities. Data lineage is thus a vital tool for the Business to be able to communicate the undergoing changes to both other business domains as well as the technical stuff responsible for implementing these changes in the underlying systems. Data Lineage provides the ability to organisations to be more agile and adapt quicker to either internal or external sources of change.
For Data Lineage however, to make a difference, it is vital that it is automated to as a high degree as possible. Manual application of Data Lineage might be necessary in some small degree but is in general counterintuitive and will not contribute positively. The key aspect of data required to achieve Data Lineage automation and scaling is Metadata. Metadata management is how we can build a semantic layer on top of the physical providing a more easily accessible and auditable way of understanding our data throughout their lifecycle and as they flow through our organisation.
Metadata exists in all parts of the organization, but they are most often either not captured at all or even when they are, they are not being used. Manual metadata insertion is very resource intensive, and all organisations will reap greater benefits by using their resources for other tasks. In order to be able to start capturing metadata efficiently and establishing an automated Data Lineage there is several tools that can assist in that regard. Even with relatively simple tools, there is a great possibility to automate tasks that can generate metadata which otherwise would need to be inputted manually. For greater needs there are more advanced options that offer even greater automation possibilities using machine learning. Machine learning algorithms can scan across multiple systems and discover common business terms and link them together automatically, or in more complicated cases, they can give suggestions that a human resource can evaluate and perform a final validation and approve these suggestions. Metadata and Data Lineage tools are also very good at capturing changes to the current state of data and alerting us to potential problems. It is vital to remember to treat data as an ever-evolving organism that continues to evolve as the organization does as well.
Summarizing, I would like to refer to the Data, Information, Knowledge, Wisdom pyramid as Data Lineage is one of the biggest catalysts into climbing this pyramid, establishing solid Data Governance practices that is automated and scalable and in general, contributing into providing the framework for better business decisions.
Faggruppen DAMA (Data Management Association – Norway Chapter)
DAMAs visjon er å standardisere og formalisere data management i Norge for å øke kompetanse og kunnskapsnivå innen fagfeltet. Vi vil fasilitere erfaringsdeling rundt data management, og ønsker å ha en positiv sosial innvirkning på samfunnet. Finn lenker til LinkedIn, nyhetsbrevet og mer på siden "Get involved". Vår LinkedIn-side inneholder alle de siste oppdateringene våre, og vårt nyhetsbrev Data Nugget gir en månedlig dose med datanyheter.