Difference between revisions of "Journal:Data without software are just numbers"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 58: Line 58:


Programmatic approaches to analysis and plotting allow for greater transparency, deliver efficiencies for researchers in academia, and, with formal training, improve employability in industry. Their adoption is further motivated by the requirements of funders and journals, which increasingly require, or at least encourage (see, e.g. the Associate for Computing Machinery<ref name="ACMSoftware18">{{cite web |url=https://www.acm.org/publications/artifacts |title=Software and Data Artifacts in the ACM Digital Library |author=Association for Computing Machinery |date=2018}}</ref>), publication of software. This evolving landscape requires a rapid and connected response from researchers, data managers, and research software engineers if institutions are to improve software development practices in a sustainable way.
Programmatic approaches to analysis and plotting allow for greater transparency, deliver efficiencies for researchers in academia, and, with formal training, improve employability in industry. Their adoption is further motivated by the requirements of funders and journals, which increasingly require, or at least encourage (see, e.g. the Associate for Computing Machinery<ref name="ACMSoftware18">{{cite web |url=https://www.acm.org/publications/artifacts |title=Software and Data Artifacts in the ACM Digital Library |author=Association for Computing Machinery |date=2018}}</ref>), publication of software. This evolving landscape requires a rapid and connected response from researchers, data managers, and research software engineers if institutions are to improve software development practices in a sustainable way.
==Establishing cultural change==
In spite of the vital role research software plays, it largely remains undervalued, with time spent in training or development seen as detracting from the "real research." The lack of recognition starts with funders’ level of investment, the development and maintenance of code, and institutions and investigators. This is compounded by the U.K.’s Research Assessment Exercises, and similar evaluations elsewhere, which have prioritized papers over all else. This results in inefficient development of new capability or introduction to new users, wasting researcher time and funder’s investment. The lack of recongition also ingrains bad habits, with the result that the longer researchers spend in academia, the lower their employability as software developers in industry becomes. Three areas in particular are key to securing the change in culture to mirror what has been achieved with research data management.
===Training===
In recent years, organizations such as Software Carpentry<ref name="SoftwareCarp">{{cite web |url=https://www.software-carpentry.org/ |title=Software Carepentry |author=Software Carpentry}}</ref> have led the development of training material to improve the professional software development skills of researchers. Material is available under Creative Commons licence and introduces programming skills and methods such as working with Unix, using version control, understanding programming languages, and practicing automation with Make.
The need for such training is recognized in the recent Engineering and Physical Sciences Research Council (EPSRC) call for new Centres for Doctoral Training (CDTs, one of the principal streams of research postgraduate funding in the U.K.)<ref name="EPSRC2018CDTs">{{cite web |url=https://epsrc.ukri.org/files/funding/calls/2018/2018cdtsoutlinescall/ |format=PDF |title=EPSRC 2018 CDTs |author=Engineering and Physical Sciences Research Council |date=February 2018 |accessdate=12 August 2019}}</ref>:
<blockquote>It is therefore a certainty that many of the students being trained through the CDTs will be using computational and data techniques in their projects ... It is essential that they are given appropriate training so that they can confidently undertake such research in a manner that is correct, reproducible and reusable such as data curation and management.</blockquote>
To achieve this, there is a need to increase the number of training sessions and range of courses. Introductory courses alone are not sufficient to generate reproducible research, manage analysis workflows, and improve paper writing.<ref name="MadsleyRepro17">{{cite web |url=http://idinteraction.cs.manchester.ac.uk/RSE2017Talk/ReproducibleResearchIsRSE.html#/ |title=Reproducible Research is Software Engineering |author=Mawdsley, D.; Haines, R.; Jay, C. |work=RSE 2017 Conference |publisher=University of Manchester |date=2017 |accessdate=12 August 2019}}</ref> This requires additional in-depth training and mentoring to develop programming skills, including the use of version control appropriately for data management and the automation of testing. Indeed, CarpentryCon events are focusing efforts to develop courses to address these and other recommendations of Jiménez et al.<ref name="JiménezFour17">{{cite journal |title=Four simple recommendations to encourage best practices in research software |journal=F1000Research |author=Jiménez, R.C.; Kuzak, M.; Alhamdoosh, M. et al. |year=2017 |doi=10.12688/f1000research.11407.1 |pmid=28751965 |pmc=PMC5490478}}</ref>





Revision as of 17:51, 1 October 2020

Full article title Data without software are just numbers
Journal Data Science Journal
Author(s) Davenport, James H.; Grant, James; Jones, Catherine M.
Author affiliation(s) University of Bath, Science and Technology Facilities Council
Primary contact Email: J dot H dot Davenport at bath dot ac dot uk
Year published 2020
Volume and issue 19(1)
Article # 3
DOI 10.5334/dsj-2020-003
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website https://datascience.codata.org/articles/10.5334/dsj-2020-003/
Download https://datascience.codata.org/articles/10.5334/dsj-2020-003/galley/929/download/ (PDF)

Abstract

Great strides have been made to encourage researchers to archive data created by research and provide the necessary systems to support their storage. Additionally, it is recognized that data are meaningless unless their provenance is preserved, through appropriate metadata. Alongside this is a pressing need to ensure the quality and archiving of the software that generates data, through simulation and control of experiment or data collection, and that which analyzes, modifies, and draws value from raw data. In order to meet the aims of reproducibility, we argue that data management alone is insufficient: it must be accompanied by good software practices, the training to facilitate it, and the support of stakeholders, including appropriate recognition for software as a research output.

Keywords: software citation, software management, reproducibility, archiving, research software engineer

Introduction

In the last decade, there has been a drive towards improved research data management in academia, moving away from the model of "supplementary material" that did not fit in publications, to the requirement that all data supporting research be made available at the time of publication. In the U.K., for example, the Research Councils have a Concordat on Open Research Data[1], and the E.U.’s Horizon 2020 program incorporates similar policies on data availability.[2] The FAIR principles[3]—that state data be findable, accessible, interoperable, and re-usable—embody the philosophy underlying this: data should be preserved through archiving with a persistent identifier, it should be well described with suitable metadata, and it should be done in a way that is relevant to the domain. Together with the OpenAccess movement, there has been a profound transformation in the availability of research and the data supporting it.

While this is a great stride towards transparency, it does not by itself improve the quality of research, and even what exactly transparency entails remains debated.[4] A common theme discussed in many disciplines is the need for a growing emphasis on "reproducibility."[5][6][7] This goes beyond data itself, requiring software and analysis pipelines to be published in a usable state alongside papers. In order to spread such good practices, a coordinated effort towards training in professional programming methods in academia, recognizing the role of research software and the effort required to develop it, and storing the software instance itsels as well as the data it creates and operates on.

In the next section of this article we next discuss two cases where the use of spreadsheets highlights the need for programmatic approaches to analysis, then in the subsequent section we review the research software engineer movement, which now has nascent organizations internationally. While some domains are adopting and at the forefront of developing good practices, the sector-wide approaches needed to support their uptake generally are lacking; we discuss this issue in the penultimate section. We finally close by summarizing how data librarians and research software engineers need to work with researchers to continue to improve the situation.

When analysis "goes wrong"

The movement towards reproducible research is driven by the belief that reviewers and readers should be able to verify and readily validate the analysis workflows supporting publications. Rather than being viewed as questioning academic rigor, this concept should be embraced as a vital part of the research cycle. Here we discuss two examples which illustrate how oversights can cause issues, which ultimately should be avoidable.

How not to Excel ... at economics

Reinhart and Rogoff’s now notorious 2010 paper showed a headline figure of a 0.1% contraction for economies with >90% debt.[8] A number of issues with their work are raised by Herndon, Ash, and Pollin[9], who were unable to reproduce the results—despite the raw data being published—since Reinhart & Rogoff's method was not fully described. Further, when the spreadsheet used for the calculation was analyzed it was found that five countries (Australia, Austria, Belgium, Canada, and Denmark) had been incorrectly omitted from the analysis. Together with methodological issues, the revised analysis showed a 2.2% growth.

The mistakes received particular attention, with numerous article published on the topic (e.g., Borwein & Bailey's 2013 article[10]), since the original paper was used to justify austerity policies aimed at cutting debt, in the U.S., U.K., and E.U., as well as within the Inernational Monetary Fund (IMF). The reliance of the proponents of these policies—and their economic and geopolitical results—on a flawed analysis should act as a stark warning that all researchers need to mitigate against error and embrace transparency.

How not to Excel ... with genes

When files are opened in Microsoft Excel, the default behaviour is to infer data types, but while this may benefit general users, it is not always helpful. For example, two gene symbols, SEPT2 and MARCH1, are converted into dates, while certain identifiers (e.g., 2310009E13) are converted to floating point numbers. Although this has been known since 2004, a 2016 study by Ziemann, Eren, and El-Osta (2016) found that the issue continues to affect papers, as identified through supplementary data. Numbers have typically increased year-on-year, with 20% of papers affected on average, rising to over 30% in Nature. This problem continues to occur despite the problem being sufficiently mature and pervasive, so much so (and despite the fact) that a service has been developed to identify affected spreadsheets.[11]

Research software

While we stress that non-programmatic approaches such as the use of spreadsheets do not of themselves cause errors, it does compromise the ability to test and reproduce analysis workflows. Further, the publication of software is part of a wider program of transparency and open access.[12] However, if these relatively simple issues occur, we must find ways of identifying and avoiding all problems with data analysis, data collection, and experiment operation. If it also makes deliberately obfuscated methods easier to identify and discuss with authors at review.

Increasingly, research across disciplines depends upon software, used for experimental control or instrumentation, simulating models or analysis, and turning numbers into figures. It is vital that bespoke software is published alongside the journal article and the data it supports. While it doesn’t ensure that code is correct, it does enable the reproducibility of analysis and allows experimental workflows to be checked and validated against correct or "expected" behavior. Making code available and employing good practice in its development should be the default, whether it be a million lines of community code or a short analysis script.

The Research Software Engineer movement grew out of a working group of the Software Sustainability Institute[13] (SSI), which has since been a strong supporter of the U.K. Research Software Engineer Association (UKRSEA), now known as the Society of Research Software Engineering (RSE).[14] The aim has been to improve the sustainability, quality, and recognition of research software by advocating good software practice (see, e.g., Wilson et al.[15]) and career progression for its developers. Its work has resulted in recognition of the role by funders and fellowship schemes, as well as growing recognition of software as a vital part of e-infrastructure. Its success has spawned sister organizations internationally in Germany, Netherlands, Scandanavia, and the U.S.

A 2014 survey by the SSI showed that 92% of researchers used research software, and that 69% would not be able to conduct their research without it.[16] Research software was defined as that used to generate, process, or analyze results for publication. Furthermore, 56% of researchers developed software, of whom 21% had never received any form of software training. It is clear that software underpins modern research and that many researchers are involved in development, even if it is not their primary activity.

Programmatic approaches to analysis and plotting allow for greater transparency, deliver efficiencies for researchers in academia, and, with formal training, improve employability in industry. Their adoption is further motivated by the requirements of funders and journals, which increasingly require, or at least encourage (see, e.g. the Associate for Computing Machinery[17]), publication of software. This evolving landscape requires a rapid and connected response from researchers, data managers, and research software engineers if institutions are to improve software development practices in a sustainable way.

Establishing cultural change

In spite of the vital role research software plays, it largely remains undervalued, with time spent in training or development seen as detracting from the "real research." The lack of recognition starts with funders’ level of investment, the development and maintenance of code, and institutions and investigators. This is compounded by the U.K.’s Research Assessment Exercises, and similar evaluations elsewhere, which have prioritized papers over all else. This results in inefficient development of new capability or introduction to new users, wasting researcher time and funder’s investment. The lack of recongition also ingrains bad habits, with the result that the longer researchers spend in academia, the lower their employability as software developers in industry becomes. Three areas in particular are key to securing the change in culture to mirror what has been achieved with research data management.

Training

In recent years, organizations such as Software Carpentry[18] have led the development of training material to improve the professional software development skills of researchers. Material is available under Creative Commons licence and introduces programming skills and methods such as working with Unix, using version control, understanding programming languages, and practicing automation with Make.

The need for such training is recognized in the recent Engineering and Physical Sciences Research Council (EPSRC) call for new Centres for Doctoral Training (CDTs, one of the principal streams of research postgraduate funding in the U.K.)[19]:

It is therefore a certainty that many of the students being trained through the CDTs will be using computational and data techniques in their projects ... It is essential that they are given appropriate training so that they can confidently undertake such research in a manner that is correct, reproducible and reusable such as data curation and management.

To achieve this, there is a need to increase the number of training sessions and range of courses. Introductory courses alone are not sufficient to generate reproducible research, manage analysis workflows, and improve paper writing.[20] This requires additional in-depth training and mentoring to develop programming skills, including the use of version control appropriately for data management and the automation of testing. Indeed, CarpentryCon events are focusing efforts to develop courses to address these and other recommendations of Jiménez et al.[21]


References

  1. Higher Education Funding Council for England, Research Councils UK, Universities UK, Wellcome (28 July 2016). "Concordat on Open Research Data" (PDF). https://www.ukri.org/files/legacy/documents/concordatonopenresearchdata-pdf/. 
  2. Directorate-General for Research & Innovation (26 July 2016). "Guidelines on FAIR Data Management in Horizon 2020" (PDR). H2020 Programme. European Commission. https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf. Retrieved 12 August 2019. 
  3. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175. 
  4. Lyon, L.; Jeng, W.; Mattern, E. (2017). "Research Transparency: A Preliminary Study of Disciplinary Conceptualisation, Drivers, Tools and Support Services". International Journal of Digital Curation 12 (1): 46–64. doi:10.2218/ijdc.v12i1.530. 
  5. Chen, X.; Dallmeier-Tiessen, S.; Dasler, R. et al. (2019). "Open is not enough". Nature Physics 15: 113–19. doi:10.1038/s41567-018-0342-2. 
  6. Mesnard, O.; Barba, L.A. (2017). "Reproducible and Replicable Computational Fluid Dynamics: It’s Harder Than You Think". Computing in Science & Engineering 19 (4): 44–55. doi:10.1109/MCSE.2017.3151254. 
  7. Allison, D.B.; Shiffrin, R.M.; Stodden, V. (2018). "Reproducibility of research: Issues and proposed remedies". Proceedings of the National Academy of Sciences of the United States of America 115 (11): 2561–62. doi:10.1073/pnas.1802324115. PMC PMC5856570. PMID 29531033. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5856570. 
  8. Reinhart, C.M.; Rogoff, K.S. (2010). "Growth in a Time of Debt". American Economic Review 100 (2): 573–78. doi:10.1257/aer.100.2.573. 
  9. Herndon, T.; Ash, M.; Pollin, R. (2013). "Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff". Cambridge Journal of Economics 38 (2): 257–279. doi:10.1093/cje/bet075. 
  10. Borwein, J.; Bailey, D.H. (22 April 2020). "The Reinhart-Rogoff error – or how not to Excel at economics". The Conversation. https://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646. 
  11. Mallona, I.; Peinado, M.A. (2018). "Truke, a web tool to check for and handle excel misidentified gene symbols". BMC Genomics 18 (1): 242. doi:10.1186/s12864-017-3631-8. PMC PMC5359807. PMID 28327106. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5359807. 
  12. Munafò, M.R.; Nosek, B.A.; Bishop, D.V.M. et al. (2017). "A manifesto for reproducible science". Nature Human Behaviour 1: 0021. doi:10.1038/s41562-016-0021. 
  13. Software Sustainability Institute. "Software Sustainability Institute". https://www.software.ac.uk/. 
  14. Society of Research Software Engineering. "RSE Society of Research Software Engineering". http://rse.ac.uk/. 
  15. Wilson, G.; Bryan, J.; Cranston, K. et al. (2017). "Good enough practices in scientific computing". PLoS Computational Biology 13 (6): e1005510. doi:10.1371/journal.pcbi.1005510. PMC PMC5480810. PMID 28640806. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5480810. 
  16. Hettrick, S.; Antronioletti, M.; Carr, L. et al. (2014). "UK Research Software Survey 2014". Zenodo. doi:10.5281/zenodo.14809. 
  17. Association for Computing Machinery (2018). "Software and Data Artifacts in the ACM Digital Library". https://www.acm.org/publications/artifacts. 
  18. Software Carpentry. "Software Carepentry". https://www.software-carpentry.org/. 
  19. Engineering and Physical Sciences Research Council (February 2018). "EPSRC 2018 CDTs" (PDF). https://epsrc.ukri.org/files/funding/calls/2018/2018cdtsoutlinescall/. Retrieved 12 August 2019. 
  20. Mawdsley, D.; Haines, R.; Jay, C. (2017). "Reproducible Research is Software Engineering". RSE 2017 Conference. University of Manchester. http://idinteraction.cs.manchester.ac.uk/RSE2017Talk/ReproducibleResearchIsRSE.html#/. Retrieved 12 August 2019. 
  21. Jiménez, R.C.; Kuzak, M.; Alhamdoosh, M. et al. (2017). "Four simple recommendations to encourage best practices in research software". F1000Research. doi:10.12688/f1000research.11407.1. PMC PMC5490478. PMID 28751965. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5490478. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; however, this version lists them in order of appearance, by design.