Journal:Towards a contextual approach to data quality

From LIMSWiki
Revision as of 18:25, 11 October 2020 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Towards a contextual approach to data quality
Journal Data
Author(s) Canali, Stefano
Author affiliation(s) Leibniz University Hannover
Primary contact Email: stefano dot canali at philos dot uni-hannover dot de
Year published 2020
Volume and issue 5(4)
Article # 90
DOI 10.3390/data5040090
ISSN 2306-5729
Distribution license Creative Commons Attribution 4.0 International
Website https://www.mdpi.com/2306-5729/5/4/90/htm
Download https://www.mdpi.com/2306-5729/5/4/90/pdf (PDF)

Abstract

This essay delves into the need for a framework for approaching data quality in the context of scientific research. First, the concept of "quality" as a property of information, evidence, and data is presented, and research on the philosophy of information, science, and biomedicine is reviewed. Based on this review, the need for a more purpose-dependent and contextual approach to data quality in scientific research is argued, whereby the quality of a dataset is dependent on the context of use of the dataset as much as the dataset itself. The rationale to the approach is then exemplified by discussing current critiques and debates of scientific quality, thus showcasing how data quality can be approached contextually.

Keywords: research data management, scientific epistemology, data quality, FAIR, reproducibility crisis

Introduction

Determining the quality of scientific data is a task of key importance for any research project and involves considerations at conceptual, practical, and methodological levels. The task has arguably become even more pressing in recent years, as a result of the ways in which the volume, variety, value, volatility, veracity, and validity of scientific data have changed with the rise of data-intensive methods in the sciences.[1] At the start of the last decade, many commentators argued that these changes would bring dramatic shifts to the scientific method and would per se make science better, thanks to fully automated reasoning, more data-driven methods, less theorizing, and more objectivity.[2] However, analyses of the use of data-intensive methods in the sciences have shown that the feasibility and benefits of these methods are not automatic results of these changes, but crucially rest upon the transparency, validity, and quality of data practices.[3] As a consequence, there are currently various attempts at implementing guidelines to maintain and promote the quality of datasets, developing ways and tools to measure it, and conceptualizing the notion of quality.[4][5][6]

This essay focuses on the latter line of research and discusses the following question: what are high-quality data? At the essay's core is a framework for data quality that suggests a contextual approach, whereby quality should be seen as a result of the context where a dataset is used, and not only of the intrinsic features of the data. This approach is based on the integration of philosophical discussions on the quality of data, information, and evidence. The next section begins by reviewing analyses of quality in different areas of philosophical research, particularly in the philosophy of information, science, and biomedicine. Then, shared results from this review are identified and integrated, with those results arguably pointing towards the need for a contextual approach. A discussion of what the approach entails and how it can be used in practice follows, looking at current debates on quality in the scientific and philosophical literature. Finally, in the conclusion, a discussion of the commentary is made and future research is proposed.

Quality as a property of information, data, and evidence

Quality has been discussed in areas of philosophical work highly engaged with research practices and debates in the sciences. In this context, three main areas of research were identified, whose results are particularly significant for conceptualizations of quality and yet have only partially been applied to issues in data quality. These results and their integration as important contributions for more general and interdisciplinary discussions on data quality are worthy of discussion. As such, this essay proposes that quality can be discussed as a property of three closely related notions: information, data, and evidence.

Information

First, research on quality has traditionally focused on information quality, which became prominent in computer science in the 1990s. In this context, an influential line of research started to move beyond traditional interpretations of quality in terms of solely accuracy, developing a multi-dimensional and purpose-dependent view whereby a piece of information is of high quality insofar as it is fit for a certain purpose.[7] This line of research has developed into two main approaches since the 1990s: surveying opinions and definitions of academics and practices from an “empirical” point of view; and studying the different dimensions of quality and interrelations between these from a theoretical and “ontological” perspective.[8] The empirical approach has expanded conceptualizations of information quality to include not only traditional dimensions such as accuracy, but also objectivity, completeness, relevance, security, access. and timeliness; here, the goal has primarily been to categorize these dimensions, rather than to define them.[9] On the other hand, the goal of the ontological approach has been to understand how to connect different dimensions of information quality (such as those surveyed through the empirical approach[10]) and conceptualize and measure potential disconnections as errors.[11]

These discussions have been picked up and analyzed in the area of research known as "philosophy of information." According to Phyllis Illari and Luciano Floridi, computer science has not fully embraced the purpose-dependent approach to information quality in all of its implications, and theoretical understandings of information quality are still in search of a way of applying the approach to concrete contexts.[6] With these problems and goals in mind, Illari has suggested that information quality suffers from a "rock-and-a-hard-place" problem.[12] While information quality is defined as information that is fit for purpose, many still think that some aspects and dimensions of information quality should be independent of specific purposes (the rock). At the same time, there is a sense in which quality should make information fit for multiple if not all purposes; a piece of information that is fit for a specific purpose, but not for others, will not be considered of high quality (the hard place). As a way of going beyond the impasse, Illari has argued that we should classify information quality on the basis of a relational model, which links the different dimensions of quality to specific purposes and uses.[12] Therefore, Illari conceives of quality as a property of information that is highly dependent on its context, i.e., the specific uses, aims, and purposes we want to employ a piece of information for. In other words, quality cannot be independent of fit for a specific purpose and cannot consist in a single fit-for-any purpose.

Data

Here, a similar push for the purpose-dependent and contextual approach has been identified in a second area of philosophical analyses, which have more specifically focused on the use of data in the context of scientific practice. The increasing volume and variety of data used in the sciences—with related and different levels of veracity, validity, volatility, and value—have created a number of potential benefits as well as challenges for scientific epistemology.[13] Determining and assessing quality is one of the main challenges of data-intensive science because of the diversity of sources of data and integration practices, the often short “timespan” and relevance of data, the difficulties of providing quality assessments and evaluations in a timely manner, and the overall lack of unified standards.[4]

Partly as a result of these shifts, philosophers of science have recently expanded their focus on data as an important component of scientific epistemology.[14] In this context, some analyses have focused on the tools that are used to calibrate, standardize, and assess the quality of data in the sciences. For instance, data quality assessment tools are often applied to clinical studies, in the form of scales or checklists about specific aspects of the study, with the goal of checking whether the study, e.g., makes use of specific statistical methods, sufficiently describes subject withdrawal, etc. According to Jacob Stegenga, there are two main issues affecting the use of these tools in the biomedical context: a poor level of inter-rating operability, i.e., different users of the tools achieve different instead of similar results; and a low level of inter-tool operability, i.e., different types of tools give different instead of similar results when assessing the same study.[15] Stegenga has argued that this can be conceptualized as a result of the underdetermination of the evidential significance of data: there is no uniquely correct way of estimating information quality, and different results will always be obtained in relation to the context, users, and type of study. These results can be interpreted in similar terms to the aforementioned analysis by Illari[12], as pointing to the crucial role that the context where data are analyzed and used plays in determination of its quality. Quality is not an intrinsic property of data that only depends on the characteristics of the data itself: quality will differ depending on contextual features, such as the tools used to assess quality, who uses them, their purposes, etc.

Further support for this point comes from Sabina Leonelli’s studies of data practices—particularly assessment methods—in the life sciences.[16] Leonelli has argued that existing approaches to data quality assessment mostly fail at delivering on their objectives or being actually used in standard practice, to the point that, currently, new and more recently developed technologies and techniques of data collection are used as unofficial markers for data quality. This leads to a problematic situation for the following reasons. Using technologies as markers of quality creates problematic relations with industry, whose economic interests in pushing specific and new technologies do not necessarily align with the epistemic aims of research communities. In particular, when quality standards are locked in and tied to specific technologies, researchers without access to those technologies cannot meet those standards. In this way, using technologies as proxies reduces diversity by creating systematic disadvantages towards researchers who have little access to the latest technologies, often excluding their contributions. To overcome these issues, Leonelli has argued for a different approach to quality: the quality of data is determined by the alignment and relations between data and other components of scientific research, including not only technologies but also research questions, methods, and infrastructures. This can be interpreted as another point for a purpose-driven and local approach to quality, which takes into account the contextual features of data use as much as the intrinsic characteristics of the data themselves.

These discussions align with other and close areas of philosophical research, which are focused on the history and epistemology of experimentation[17][18] and the role of measurement practices, concepts, and quantity terms.[19] In this context, measurement has been discussed as an inferential process that starts from instrument indications and results in outcomes, in the form of claims about the status of the object that is measured. In this sense, Bas van Fraassen has interpreted measurement outcomes as regions of the space of possible values identified by measurement practices, whose dependence on theory is involved at the stage of the interpretation of the outcomes as much as for their capacity of representing the objects of interest.[20] More recently, Luca Mari has argued that measurement should be discussed as a form of information gathering; on this basis, measurement and standardization practices should be seen as producers of knowledge, and their quality can be measured as the quality of the types of knowledge they produce.[21] In this direction, standardization is a type of modeling, whereby the calibration of measurement as a system of practices and conceptualizations is obtained by the specific modeling and representation of the elements involved in a specific context of those measurements.[22][23]

Evidence

The third line of philosophical research discussed here has focuses on quality as a property of scientific evidence, especially in the biomedical context. This research has partly been a reaction to the rise of evidence-based medicine (EBM), an approach to medical research and practice that is based on a specific categorization and ranking of evidence. Since the 1980s, as a movement to reform medical practice and research, EBM has aimed to improve decision-making by removing the influence of subjective preferences from different stages of the process. As formulated by Sackett and colleagues, the central idea of EBM has been “the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.”[24] Practically, EBM proponents have introduced “evidence hierarchies,” which describe the assumed quality of different types of evidence and are supposed to help decision makers to project some order in the available evidence.[25] This order aligns better support for the efficacy of different interventions with better evidence types, which in the EBM context consists of evidence from randomized controlled trials or systematic reviews and meta-analyses of randomized controlled trials.

Philosophers of medicine have analyzed and criticized various tenets of EBM, including the theoretical and methodological basis of the choice of specific types of evidence as high-quality evidence [27], the exclusion and denigration of some types of evidence[26], and the ways in which hierarchies of evidence are delineated in evidence-based approaches.[27] While these analyses have not explicitly taken issue with notions of quality per se, their results are significant for this discussion on how to approach data quality. The ways in which evidence is classified and its quality is assessed in EBM seem to apply an intrinsic and universalistic approach to evidence, whereby, e.g., evidence collected through randomized controlled trials (RCTs) is “gold standard.” This means that RCTs are normally given the highest level of quality, although this may be lowered in case of methodological problems; instead, evidence from other methods such as observational studies could be ranked as high-quality, but are automatically given a lower rank as the starting point.[28] In other words, certain methods are considered to be prima facie and epistemically superior, with a gold-like, higher value compared to alternatives.[29] The problem with this approach to the classification of evidence quality is that it is applied to most areas of biomedical research, with no consideration for specifics and different research contexts. In many areas of biomedical research, the gold standard evidence hailed by EBM often cannot be produced, but this does not necessarily mean that the evidence produced is of low quality. For example, Saana Jukola has shown that in nutrition research, RCTs cannot be conducted because of practical, ethical, and methodological aspects of this line of research.[30] Differently from the EBM approach, the quality of biomedical evidence is used to meet specific rather than universal hierarchies, depending on the aims and the context in which it is to be used.


References

  1. Leonelli, S. (2020). "Scientific Research and Big Data". Stanford Encyclopedia of Philosophy Archive (Summer 2020). https://plato.stanford.edu/archives/sum2020/entries/science-big-data/. 
  2. Canali, S. (2016). "Big Data, epistemology and causality: Knowledge in and knowledge out in EXPOsOMICS". Big Data & Society 3 (2). doi:10.1177/2053951716669530. 
  3. Leonelli, S. (2014). "What difference does quantity make? On the epistemology of Big Data in biology". Big Data & Society 1 (1). doi:10.1177/2053951714534395. 
  4. 4.0 4.1 Cai, L.; Zhu, Y. (2015). "The Challenges of Data Quality and Data Quality Assessment in the Big Data Era". Data Science Journal 14: 2. doi:10.5334/dsj-2015-002. 
  5. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175. 
  6. 6.0 6.1 Illari, P.; Floridi, L. (2014). "Chapter 2: Information Quality, Data and Philosophy". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 5–23. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213. 
  7. Wang, R.Y.; Reddy, M.P.; Kon, H.B. (1995). "Toward quality data: An attribute-based approach". Decision Support Systems 13 (3–4): 349–72. doi:10.1016/0167-9236(93)E0050-N. 
  8. Wang, R.Y. (1998). "A product perspective on total data quality management". Communications of the ACM 41 (2): 58–65. doi:10.1145/269012.269022. 
  9. Batini, C.; Scannapieca, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Springer. ISBN 9783540331728. 
  10. Wand, Y.; Wang, R.Y. (1996). "Anchoring data quality dimensions in ontological foundations". Communications of the ACM 39 (11): 86-95. doi:10.1145/240455.240479. 
  11. Primiero, G. (2014). "Chapter 7: Algorithmic Check of Standards for Information Quality Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 107–34. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213. 
  12. 12.0 12.1 12.2 Illari, P. (2014). "Chapter 14: IQ: Purpose and Dimensions". In Floridi, L., Illari, P.. The Philosophy of Information Quality. Springer International Publishing. pp. 281–301. doi:10.1007/978-3-319-07121-3. ISBN 9783319071213. 
  13. Leonelli, S.; Tempini, N. (2020). Data Journeys in the Sciences. Springer. doi:10.1007/978-3-030-37177-7. ISBN 9783030371777. 
  14. Leonelli, S. (2016). Data-Centric Biology: A Philosphical Study. University of Chicago Press. ISBN 9780226416502. 
  15. Stegenga, J. (2013). "Down with the Hierarchies". Topoi 33: 313–22. doi:10.1007/s11245-013-9189-4. 
  16. Leonelli, S. (2017). "Global Data Quality Assessment and the Situated Nature of “Best” Research Practices in Biology". Data Science Journal 16: 32. doi:10.5334/dsj-2017-032. 
  17. Hacking, I. (1993). Representing and Intervening: Introductory Topics in the Philosophy of Natural Science. Cambridge University Press. doi:10.1017/CBO9780511814563. ISBN 9780511814563. 
  18. Rheinberger, H.-J. (2010). An Epistemology of the Concrete: Twentieth-Century Histories of Life. Duke University Press. ISBN 9780822345756. 
  19. Chang, H.; Cartwright, N. (2008). "Chapter 34: Measurement". In Curd, M.; Psillos, S.. The Routledge Companion to Philosophy of Science (1st ed.). Routledge. pp. 367–75. doi:10.4324/9780203000502. ISBN 9780203000502. 
  20. van Fraassen, B.C. (2008). Scientific Representation: Paradoxes of Perspective. Oxford University Press. doi:10.1093/acprof:oso/9780199278220.001.0001. ISBN 9780199278220. 
  21. Mari, L. (2003). "Epistemology of measurement". Measurement 34: 17–30. doi:10.1016/S0263-2241(03)00016-2. 
  22. Boumans, M. (2007). "Chapter 9: Invariance and Calibration". In Boumans, M.. Measurement in Economics: A Handbook. Academic Press. pp. 231–47. ISBN 9780123704894. 
  23. Tal, E. (2013). "Old and New Problems in Philosophy of Measurement". Philosophy Compass 8 (12): 1159–73. doi:10.1111/phc3.12089. 
  24. Tal, E. (1996). "Evidence based medicine: What it is and what it isn't". BMJ 312: 71. doi:10.1136/bmj.312.7023.71. 
  25. Bluhm, R. (2005). "From hierarchy to network: A richer view of evidence for evidence-based medicine". Perspectives in Biology and Medicine 48 (4): 535–47. doi:10.1353/pbm.2005.0082. PMID 16227665. 
  26. Clarke, B.; Gillies, D.; Illari, P. et al. (2013). "Mechanisms and the Evidence Hierarchy". Topoi 33: 339–60. doi:10.1007/s11245-013-9220-9. 
  27. Campaner, R.; Galavotti, M.C. (2012). "Evidence and the Assessment of Causal Relations in the Health Sciences". International Studies in the Philosophy of Science 26 (1): 27–45. doi:10.1080/02698595.2012.653113. 
  28. Kerry, R.; Eriksen, T.E.; Lie. S.A.N. et al. (2012). "Causation and evidence‐based practice: An ontological review". Journal of Evaluation in Clinical Practice 18 (5): 1006–12. doi:10.1111/j.1365-2753.2012.01908.x. 
  29. Stegenga, J. (2011). "Is meta-analysis the platinum standard of evidence?". Studies in History and Philosophy of Science Part C 42 (4): 497–507. doi:10.1016/j.shpsc.2011.07.003. 
  30. Jukola, S. (2019). "On the evidentiary standards for nutrition advice". Studies in History and Philosophy of Science Part C 73: 1–9. doi:10.1016/j.shpsc.2018.05.007. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.