The quality of data is both the most common and the most complex data challenge facing organisations today. Virtually every organisation I encounter feels that their data needs to be ‘better’ and some of them are investing significant time and effort to achieve that. The term Data Quality is appearing in more and more job titles across the HE sector and the Data Futures programme is – or at least should be – forcing HE providers to re-think their approach to managing data quality.
What is data quality and why are we never quite satisfied with it?
To understand data quality we firstly need to think about the nature of data itself. Data exists as a way of describing the world in order to undertake a particular task or function. The world is messy and complex, ever changing and prone to throwing up unexpected challenges and scenarios; data is rigid and relatively simple and it usually requires things to be classified in relatively simple ways. Data needs to represent the world well enough for the given task or function to work successfully so the term ‘data quality’ should be about the fitness for purpose for the defined use.
Data failures are often complex and nuanced. Because data tends towards neat classifications we often think of data quality in a similarly neat way - it’s right or it’s wrong. Data items where quality can genuinely be described in these terms are relatively rare; date of birth is one such example. But even this example needs to be sensitive to the purpose to which the data is put. For example a statistical analysis of students ages across the sector is unlikely to be significantly disrupted if one out of thousands of records has an incorrect date of birth – especially if a date is only wrong by a small amount. Conversely, an administrative process that depends on the student’s age could fail if the date of birth is not recorded correctly.
How does data go wrong?
Data can go wrong – sorry, I mean ‘not be fit for purpose’ - in three broad ways.
Data can fail by specification when the design of the data structures do not allow the data to describe the world in a way that is fit for the intended use. This includes the design of data models (i.e defining the entities, their relationships and their attributes correctly), the creation of code lists for individual fields and the format of fields. It is often the case that the world turns out to be more complicated than we first imagine and the rare or unusual cases need to be accommodated somehow in the data; anybody with a surname of more than 255 characters will have a problem with the specification of many HE datasets.
Data can fail when the implementation of the system or process compromises the achievement of the defined purpose. This can be due to issues like scope or timeliness of the data process; where records are missing or the data arrives too late to achieve the stated purpose.
Even if specification and implementation are good, data can still be the victim of a process failure when the processes that interact with the data fail to operate correctly. This can lead to things like the failure to conform to the data specification (which is often easy to spot) or the occurrence of erroneous data (which is often impossible to spot).
Data re-use and the dangers of creep
Capturing and processing data can be expensive and the desire to extract more value from data assets often leads to the re-use and re-purposing of data. On the face of it this seems like a laudable aim but if the data has been designed, processed and quality assured for one purpose we cannot assume it will be fit for other purposes. This kind of purpose-creep can give a dataset a bad name and is, in my experience, the source of a lot of data quality issues.
Data quality – science or art?
Although we tend to think of data as hard and scientific, data can fail in ways that are often complex, nuanced and soft. Understanding data quality and the ways in which data fails can feel a lot more like an art than a science and we need to establish this understanding if we are to manage and improve data quality within our organisations.
Next time: Managing data quality
Andy Youell is a writer, speaker and strategic data advisor. Formerly the Director of Data Policy and Governance at HESA, Andy has been at the leading edge of data issues across higher education for over 25 years. His work has covered all aspects of the data and systems lifecycle and in recent years has focussed on improving the HE sector’s relationship with data. Follow him on Twitter @AndyYouell