December Nugget Horsing around with data quality

There is hardly any data management professional who is surprised at the sight of a data quality issue. We all know that in large amounts of data there might be gold, but you will likely spend most of your time digging through dirt to get to it.

What we might not realise though, is just how many “non data” industries are actually facing data quality issues!

We dig through the dirt because we know that successful use of data gives competitive advantages. Industries like the financial sector, retail, health or production are all amassing large amounts of data in the organization’s effort to become digitized and data driven. Many of us are working very hard to manage it well. Few – if any – succeed in exploiting the data’s full value potential. Data quality is an area that is becoming increasingly important – and complex.

In my simple way of thinking, good data quality is when the data is fit for its intended use. To quote the DMBoK v2: “Data is of high quality to the degree that it meets the expectations and needs of the data consumers.” Data quality can be quantified and measured in accordance with a set of dimensions, like for instance completeness, uniqueness, timeliness, validity, accuracy and consistency (DAMA UK 2013/DMBoK V2). In data quality, typically we struggle to achieve and sustain high quality throughout the data’s lifecycle when we:

  • Collect the data
  • Store and process the data
  • Analyse and implement actions based on insights

An interesting observation I have made, is how similar the challenges of managing data are – even across entirely different industries! If the company has a somewhat similar size and complexity of its data landscape, many similar challenges seem to arise.

Now, just how far do these similarities stretch? I wouldn’t be too afraid to assume that the data quality issues often observed in the finance sector are also mirrored in insurance. But how about in governmental bodies that does record keeping, like the social security number records? Yes! They are there too. So I really shouldn’t have been surprised, I guess, when I found out the same applies even if your business is dealing with horses.

Photo by: The Norwegian Horse Association/Therese Selle

Horses you say?? And how exactly is that related to data? It’s not exactly two topics you often combine in one sentence. And therefore, it is a wonderful example that truly supports Norway’s ambitions on becoming a data driven economy, as stated in the Report to the Parliament/white paper (Melding til Stortinget) as of 26th March 2021 “Message on data driven economy and innovation.”» Data is truly embedding itself in all parts of society and the need for data literacy is ever growing. And one person to help raise the bar, is Therese Selle from the Norwegian Horse Association!

Therese hasn’t worked that long at the Norwegian Horse Association yet, although she brings with her much needed competency in data management. She has studied the preservation of Norwegian Horse breeds and works as a Breeding Advisor for the Native Horse Breeds. Her most valuable weapon? Data!

Therese explains why data is important to her work as a breeding advisor: “Data is important in order to surveil and develop a system for preserving the breeds.” Preserving the native horse breeds fjord horse (fjordhest), Northlands horse (norlandshest/lyngshest) and dole horse (dølahest) requires a lot of data. And the most impressive part is the collection and ingestion of data, which is heavily influenced by traditions centuries old – showing horses and combining the best specimens for selective breeding. All over the country, the different horse breeds are shown in shows specialised for the breed and gender and scored by professionals. These scores go into a central registry, along with other key data. Not many thinks much of the data collection itself, not even aware that they are in fact data producers. This is also very recognizable – the customer responsible entering the data needed to establish a loan in the bank might be just as unaware that he/she is a data producer, as the judge marking a horse at a horse show.

Photo by: The Norwegian Horse Association/Therese Selle

The Norwegian Horse Association holds the responsibility for keeping the national registers of horses, on request by the Ministry of Agriculture and Food. They keep track of the total number of horses, the breeds, the births and deaths, imports and exports, and all the family history of the horses. Quite extensive, and impressive really, when you start to understand the scope of data in a business you’d really not think used words such as database or data quality. The data collection and analysis is done for many important purposes, such as hindering inbreeding and preserving the Norwegian horse breeds with their unique qualities and ancient history. It’s also about food security, on a mission from the The Norwegian Food Safety Authority. All good, worthy causes – now where’s the catch?

Photo by: The Norwegian Horse Association/Therese Selle

The catch is – all this data can make a really big mess! Exactly like someone trying to report on the exact state of transactions or anti money laundering in a bank might experience. As mentioned, there are three main sections of the data life cycle where things get messy. Let’s explore them one by one:

1. Collect the data
The data from the shows are collected through a very manual process – sound familiar? The data is actually created first by handwriting as a secretary writes for the judge. This is then mailed (gasp) and later manually punched into a digitized format. Don’t get me started on the format of the forms, which in addition to nice quantitative scores contains a lot of free text fields for essential information. The forms and this whole setup make us data people scream internally: “Automation! Digitization!” However, here’s where the human factor comes in: These shows are the stuff of traditions, which is an important driver in the whole preservation work. Meaning, change might be highly due, but must be handled carefully and respectfully of those long-existing practises.

2. Store and process the data
As many organizations could groan in unison – data storage and processing can be tricky stuff, especially when you have a little history. Messy data landscapes, accumulated from system integrations-ish and data kind of-transfers over time are no rare commodity – unfortunately. The Norwegian Horse Association stands along many giants with far larger cash flows and operating in much more digitized industries. They all feel the strain of trying to keep up with the dizzyingly fast development in technology which impacts data. This means that data quality takes a toll, often manifesting itself in the inability to use historical records properly – inconsistencies making them inconsistent with more recent data. Hefty clean-ups (which usually costs more than can/will be funded) and some serious data scouring can in lucky cases help out, but rarely does the organization have time to prioritize this – we are already sprinting to keep up with the newest tech development!

3. Analyse and implement actions based on insights
The quantitative data from the shows might not be as consistent or complete as one might wish – but at least it is easy to aggregate, compare and drill down into this data. Not so much when it comes to the qualitative data! The vast amount of unstructured data lies about with plenty of value which is locked in pretty hard without specialized tools or capacity to take advantage of this information in a large and systematic way. Exactly like a lot of agreement and customer related data in other industries – locked in unstructured documents stored deeply away in a dark corner.

In conclusion, it’s fascinating and both depressing and uplifting to see how similar data quality issues are across all kinds of vastly different industries. Depressing because a lot of us are struggling with some pretty challenging issues, uplifting because there is huge potential in us coming together to think up good solutions! Moreover, I am intrigued to learn that not only can the screaming need for more data savvy professionals guarantee you pretty good job security over the next decade(s), but you might also be able to work with some pretty niche topics you can geek out on. Who knew that dreaming horse fans can combine their hobby with working as a data management professional? Most likely, you can find a data spin on your hobby too! All inn all, we can safely say that the future holds lots of opportunity for those who are not horsing around with their data quality.

How many of the data quality challenges and strengths can you recognize in your organization?

Quick overview on data management at The Norwegian Horse Association:

  • Purpose: Manage the Norwegian horse breeds and report to the Ministry of Agriculture and Food. Pedigree records keeper for many breeds, records keeper for all horses in Norway.
  • Data: Pedigree data, horse show scores
  • Data consumers: Ministry of Agriculture and Food, The Norwegian Food Safety Authority, Researchers, horse breed associations, horse breeders, horse owners.
  • Data producers: Horse show judges, horse breeders, horse owners, The Norwegian Horse Association
  • Data Quality challenges: Manual data entry with numerous steps before digitization, lacking data integrations, missing registrations, resistance to change processes affecting data collection with data producers, complex data landscape with difficulty to land on blueprint/TO-BE
  • Data Quality strengths: long traditions and strong subject matter expertise for the critical data elements, long history of data, large datasets with substantial analytical value potential
    How many of the data quality challenges and strengths can you recognize in your organization?