Journal:Promoting data sharing among Indonesian scientists: A proposal of a generic university-level research data management plan (RDMP)
Full article title | Promoting data sharing among Indonesian scientists: A proposal of a generic university-level research data management plan (RDMP) |
---|---|
Journal | Research Ideas and Outcomes |
Author(s) | Irawan, Dasapta E.; Rachmi, Cut N. |
Author affiliation(s) | Institut Teknologi Bandung, Universitas Padjadjaran |
Primary contact | Email: dasaptaerwin at outlook dot co dot id |
Year published | 2018 |
Volume and issue | 4 |
Page(s) | e28163 |
DOI | 10.3897/rio.4.e28163 |
ISSN | 2367-7163 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://riojournal.com/articles.php?id=28163 |
Download | https://riojournal.com/article/28163/download/pdf/ (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
Every researcher needs data in their working ecosystem, but despite the resources (funding, time, and energy) they have spent to get the data, only a few are putting more real attention into data management. This paper mainly describes our recommendation of a research data management plan (RDMP) at the university level. This paper is an extension of our initiative, to be developed at the university or national level, while also in-line with current developments in scientific practices mandating data sharing and data re-use.
Researchers can use this article as an assessment form to describe the setting of their research and data management. Researchers can also develop a more detailed RDMP to cater to a specific project's environment. In this RDMP, we propose three levels of storage: offline working storage, offline backup storage, and online-cloud backup storage, located on a shared-repository. We also propose two kinds of cloud repository: a dynamic repository to store live data and a static repository to keep a copy of final data.
Hopefully, this RDMP could solve problems on data sharing and preservation, and additionally it could improve researchers' awareness about data management to increase the value and impact of their research efforts.
Keywords: research data management plan, open data, data sharing, data repository, reproducible research
Introduction
Good data management is capable of supporting scientific discovery[1], yet we have been observing a cultural barrier on data sharing.[2] More insights about data sharing and the diverse perceptions among scientists in various fields have been endlessly discussed.[3][4][5][6]
Every researcher needs data in their working ecosystem, but despite the resources (funding, time, and energy) they have spent to get the data, only a few are putting more real attention into data management.[7][8] A data management strategy is not just an administrative document; it also plays an important role in guiding researchers in storing, backing up, preserving, and sharing their research data in a proper and sustainable manner.
This paper describes a guideline to build a university-level research data management plan (RDMP) and how it can promote data sharing among scientists. This RDMP would be the first one to be developed at the university level in Indonesia. This project is in-line with current development in scientific practices mandating data sharing and data re-use. The goals of this RDMP project are to build awareness about data sharing and preservation to scientists, especially academic staffs, and to build a practical and simple tool to help them manage their research data. The goal of an RDMP project is to guide researchers in managing their data, including curating, storing, sharing, and preserving it for immediate and future use.
This RDMP proposal is largely extracted from our experience in developing RDMP for an international research collaboration funded by RCUK (Research Council UK).[9]
Description
General overview
The concern to having a proper RDMP was triggered by difficulties faced by researchers to find data from another researcher or previous research and to extract data from reports. The other problem is to find guidelines, especially in Indonesia, on how to appropriately manage your research data, to store them, and to keep them available in the long run. Clearly scientists have issues on how to re-use datasets from prior research, how to cite them in their own work (re-use), and how to know the limitation of such actions.
Due to the large effort to get data in terms of funding, time, and energy, the longevity of data should be more than one or two years, as we find to be the general case in the Indonesia research ecosystem (Fig. 1).[9][10][11][12] Another important point to address is the barrier of data sharing that involves the fear of getting scooped, the lack of knowledge concerning intellectual property rights (IPR), and data ownership. Therefore, by developing this document, we could solve the barriers and at the same time we could come up with another way to increase the value of research data, instead of only looking at mainstream metrics.
|
How to use this article as a set of guidelines
Researchers can use this article as an assessment form to describe the setting of their research and data management requirements from a potential funder. Researchers can also develop a more detailed RDMP to cater to a specific project's environment. They should justify the setting of their research and requirement of the funder regarding data sharing and data preservation.
Seven components in RDMP
The proposed RDMP is divided into seven components:
- Data collection
- Documentation and metadata
- Storage and backup
- Preservation
- Sharing and re-use
- Responsibilities and resources
- Ethics and legal compliance
References
Given the different nature of research, funders, and DMP standards, we refer to the following sources in developing this RDMP:
- Data sharing culture (Neylon 2017a[10], Neylon 2017b[11])
- Open data principles and reproducible research (Irawan et al. 2017[13])
- RDMP check lists or rubric (Digital Curation Center 2014[14], Teperek et al. 2017[15], University of California Curation Center 2018[16])
- RDMP case study from various fields of sciences (Neylon 2017c[17], Traynor 2017[18], Wael 2017[19], Woolfrey 2017[20])
Component 1: Data collection
What types of data will you collect, create, link to, acquire and/or record?
This RDMP covers the following type of data or documents, which are considered data sources:
- Raw data that may come in the following forms:
- any field or laboratory measurements collected during in a research
- any voice recording and its transcript of an interview or any other forms of data collection phase
- any vector and raster based images
- any video recording and its text caption of an interview or any other forms of data collection phase
- survey form responses from participants
- field notes or laboratory records
- Grant Proposals: funders may request researcher to submit their research plan as a pre-registration document in several platforms such as OSF or Curate Science
- Project-level RDMP: some funders, such as RCUK, mandate the submission of a final RDMP before the project begins
- Shared texts, voice, or video recordings of communication between team member
- Reports: may appear as a preliminary report, mid-term report, final report, or short communications
- Preprints: the preprint has been admitted as part of research output by several funders[21]
- Maps
What file formats will your data be collected in? Will these formats allow for data re-use, sharing and long-term access to the data?
Although most researchers use Microsoft-based applications, and most open repositories accept and provide a native viewer for many formats, the following are our choice of formats. You may refer to University of Sydney RDMP file formats or Cornell University’s preservation file formats for more information.
Spreadsheets
They should be written in text format, e.g., .csv (comma separated value) or .txt (using tab separated value). Data creators should format the spreadsheet in a "database" format by:
- starting the data immediately in cell (1,1);
- avoiding merging rows or columns; and
- clearly using the correct and consistent cell format, e.g., number, string, date, time, and category.
Documents
We recommend a text-based (ASCII) file, e.g., .txt, Markdown, or any other text format that can be created and read using a plain text reader like Notepad.
Audio/video recordings
- Audio recordings: .wav or .mp3
- Video recordings: .mp4 or .mpg
Images and maps
- General image: .jpg, .png, .bmp, .tiff
- Raster: geoTiff
- Vector: .shp
Emails (project communications)
Although most researchers are now using proprietary email clients like Microsoft Outlook or Apple Mail, they still need to store selected emails in plain text as well.
What conventions and procedures will you use to structure, name and version control your files to help you and others better understand how your data are organized?
Files are uploaded to an online repository and organized into folders by phase or by working package. If the file organization get too complicated to accommodate a set folder structure, then it should be separated and linked together. We recommend the following set of folders to organize the files.
root folder:
- data:
- raw
- processed
- analysis
- code (or script)
- tables
- figure or image
- output
- report
- presentation
- article (or manuscript)
Some field of research may have other specific folder arrangements, but generally they should have the components in the figure. If some team members choose to maintain a Google Drive, DropBox, Onedrive or other cloud service, then they should make an accessible link to the drives or folders and register the links to the data repository. To accommodate limited storage, the principal investigator (PI), co-PI, and team members may also maintain an open repository, such as OSF, Figshare, Zenodo, GitHub, GitLab, and other similar services, given that such services offer version control and access option features. All services should be linked together to a central repository. The team may also maintain a dedicated project website to store the data and related research documents, to keep track of the activities, and to store the project's repository or storage structure.
Component 2: Data documentation and metadata
What documentation will be needed for the data to be read and interpreted correctly in the future?
All data will be preserved in open formats to ensure its readability in the future. Any metadata should be attached to each data file, or in some instance, a data folder. A README file should be included in the root folder, containing folder structure, a general overview, and some context of the data.
How will you make sure that documentation is created or captured consistently throughout your project?
All deliverables (data, reports, presentations, preprints, etc.) should be recorded, listed, and stored in the project repository. A README file may be useful to describe the context, time frame, location, structure, and status of the files. Data staff (DS) may be assigned to check the status of the documentation.
What metadata standard will be needed to describe your data?
We recommend the following minimum metadata schema for general data:
- Title of the dataset (see example)
- Abstract (to give context)
- Creator
- Contributor
- Publisher
- Funder
- Date of publication
- Resource type
- Location
- License/rights
- Data structure
- Data size
- File format
For geospatial dataset, we refer to the ISO 19115-1:2003 geospatial metadata standard, which is also used by Badan Informasi Geospatial of Indonesia (Indonesia Board of Geospatial Information). A minimum metadata schema for general dataset and general geodataset can be found here.
Component 3: Storage and backup
What are the anticipated storage requirements for your project, in terms of storage space (in megabytes, gigabytes, terabytes, etc.) and the length of time you will be storing it?
We anticipate less than five gigabytes of data and documents to be generated by the project. As far as possible, data will be deposited in long-term archives. A minimum of 10 years of preservation should be in consideration, but there are open repositories that provide longer preservation time, e.g., up to 50 years or more. Data should be deposited at the start of the project and ended by the time the final report is submitted to the project funder. An embargo period (maximum of two years) may be assigned if needed. Following the end of the embargo period, assigned data staff must make the data publicly available until a minimum of 10 years.
How and where will your data be stored and backed up during your research project?
Data and documents are stored at three storage levels:
- working offline storage and at least one offline backup using a portable hard drive
- an online dynamic data repository using the university's available institutional repository and/or open repository services like the OSF (maintained by Center for Open Science), Figshare (maintained by Digital Science), or Zenodo (maintained by CERN)
- an online static data repository; an institutional repository can be used to store the final dataset and other documents
We suggest the following back up strategies:
- back up from offline working storage to portable media must be preformed immediately; daily backup is highly recommended
- back up to cloud storage or repository at least once a week
- team members are suggested to use a back up application such as Apple Time Machine or Free File Sync
How will the research team and other collaborators access, modify, and contribute data throughout the project?
The research team, relevant members of the research team, and project participants will be granted access to the data repository and to other online services. The access will be set through a unique user ID and password system before the embargo period ends. The minimum access for the above-mentioned parties will be "read-write" access, while an "administrator" role should be given to the PI and at least two other team members: one co-PI and data staff. After exceeding the embargo period, the data repository will be made public.
Component 4: Preservation
Where will you deposit your data for long-term preservation and access at the end of your research project?
Selection of material
All final materials as follows will be kept available in the Institutional Repository and OSF dynamic repository:
- data:
- raw data
- final processed data
- reports:
- preliminary report
- mid term report and
- final report
All intermediate and ongoing files, including data and other documents, will be made available in the OSF dynamic repository.
Preservation
Long term preservation of publicly available data will be through appropriate repositories, including institutional repository. More than one archive may be selected using the LOCKSS principle or FAIR principle for data sharing as the main criteria. In this case, we recommend the OSF dynamic repository and static institutional repository.
Indicate how you will ensure your data is preservation-ready. Consider preservation-friendly file formats, ensuring file integrity, anonymization and de-identification, inclusion of supporting documentation.
For all data generated from research, we may ask the data creator to convert it from any proprietary file formats to open formats for long term preservation. Another option would be to have data staff (DS) assigned to work on file conversion. The data creator or DS should ensure the anonymization/de-identification of sensitive data.
Component 5: Sharing and reuse
What data will you be sharing and in what form (e.g., raw, processed, analyzed, final)?
In a general sense, we recommend sharing raw, processed, analyzed, and final datasets. However, given the nature of the project, PIs may appeal for another form of data sharing. They could complete a data assessment form in order to come up with an appropriate data sharing mechanism. PIs may have to:
- choose which type of data that they think could be safely shared without breaching a data release agreement with other parties, and
- separate primary data from another institution from the primary new data acquired by team members.
Have you considered what type of end-user license to include with your data?
We recommend using moderate licenses, e.g., a CC-BY license, MIT license, and Academic Free License as the default license for data and also for all resulting documents. However, the PI may propose another more lenient license such as the CC0 waiver or CC-BY-SA license. For sensitive data, PIs may suggest a more restrictive license.
What steps will be taken to help the research community know that your data exists?
All data and associated data repository should be able to be found by at least one indexing service, e.g., Google Scholar. Common repositories are now accessible via BASE and ONESearch (a feature from the Indonesia National Library and Archive). To be formally cited, we also recommend the use of a persistent link, e.g., a DOI from CrossRef or Datacite.
Component 6: Responsibilities and resources
Identify who will be responsible for managing this project's data during and after the project and the major data management tasks for which they will be responsible.
PI and an assigned DS are responsible for research data management. This includes file conversion, classifying, and managing the various research outputs identified in this RDMP, throughout the research cycle and during the lifetime of the data.
How will responsibilities for managing data activities be handled in case substantive changes happen in the personnel overseeing the project's data, including a change of principal investigator?
In the case of a change of PI or DS, responsibility will be transferred to one of the co-PIs or to a DS assigned by the PI or institution.
What resources will you require to implement your data management plan? What do you estimate the overall cost for data management to be?
Aside from the data collection phase, the major costs of data management for the project are for management and storage components. The management components should be funded by the research project, while storage is the responsibility of the university, or a PI may select a free, open repository.
Component 7: Ethics and legal compliance
An intellectual property rights (IPR) officer at the university level is very much needed in this case, but researchers should also have enough basic knowledge regarding this subject.
If your research project includes sensitive data, how will you ensure that it is securely managed and accessible only to approved members of the project?
A university-level or several faculty-level data stewards (DS) should be assigned to ensure the management of sensitive data and data management in general. The access to such data may be restricted to PI, one of the co-PIs, and the DS. The DS will have a checklist form to help them assess the situation.
If applicable, what strategies will you undertake to address secondary uses of sensitive data?
Users must register to access the data or contact the university DS, filling out a sensitive data usage form. The form then will be evaluated by a university-level or faculty/school-level DS, given that the DS should also consult with the data creator or original researcher.
How will you manage legal, ethical, and intellectual property issues?
IP rights for the project are largely held by the university, or there could be joint IPR management for joint research activity. It should be clearly mentioned in the data agreement.
Acknowledgements
We thank the following persons for their feedback and corrections to this article: the repository team of Institut Teknologi Bandung, Willem Vervoort and Gene Melzack (from University of Sydney), Driajana, Akhmad Riqqi, and Yudi Darma (from UDARA team), and also Sarah Lindley (from University of Manchester), open science community and INArxiv preprint server users.
Hosting institution
The university solely, or in case of a joint research, the hosting institution should be clearly stated in the data sharing and ownership agreement.
Author contributions
Both authors contributed evenly to this article.
Conflicts of interest
Both authors declare no competing interest upon the publishing of this paper.
References
- ↑ Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18.
- ↑ Davidson, J.; Jones, S.; Molloy, L. et al. (2014). "Emerging Good Practice in Managing Research Data and Research Information within UK Universities". Procedia Computer Science 33: 215–22. doi:10.1016/j.procs.2014.06.035.
- ↑ Tenopir, C.; Allard, S.; Douglass, K. et al. (2011). "Data sharing by scientists: Practices and perceptions". PLoS One 6 (6): e21101. doi:10.1371/journal.pone.0021101. PMC PMC3126798. PMID 21738610. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3126798.
- ↑ Tenopir, C.; Dalton, E.D.; Allard, S. et al. (2015). "Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide". PLoS One 10 (8): e0134826. doi:10.1371/journal.pone.0134826. PMC PMC4550246. PMID 26308551. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4550246.
- ↑ van Panhuis, W.G.; Paul, P.; Emerson, C. et al. (2014). "A systematic review of barriers to data sharing in public health". BMC Public Health 14: 1144. doi:10.1186/1471-2458-14-1144. PMC PMC4239377. PMID 25377061. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4239377.
- ↑ Wallis, J.C.; Rolando, E.; Borgman, C.L. (2013). "If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology". PLoS One 8 (7): e67332. doi:10.1371/journal.pone.0067332. PMC PMC3720779. PMID 23935830. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3720779.
- ↑ Irawan, D.E. (24 April 2018). "RDM policy and data archiving at university level -- technical bits -- an example from ITB". Figshare. https://figshare.com/articles/RDM_policy_and_data_archiving_at_university_level_--_technical_bits_--_an_example_from_ITB/6179084/1.
- ↑ Irawan, D.E. (19 September 2017). "A light introduction to research data management". Figshare. https://figshare.com/articles/A_light_introduction_to_research_data_management/5418694/1.
- ↑ 9.0 9.1 Irawan, D.E.; Rachmi, C.N. (15 May 2018). "Promoting data sharing among Indonesian scientists: A proposal of generic university-level RDMP". Open Science Framework. doi:10.17605/OSF.IO/59VCN. https://osf.io/59vcn/.
- ↑ 10.0 10.1 Neylon, C. (2017). "Compliance Culture or Culture Change? The role of funders in improving data management and sharing practice amongst researchers". Research Ideas and Outcomes 3: e14673. doi:10.3897/rio.3.e14673.
- ↑ 11.0 11.1 Neylon, C. (2017). "Building a Culture of Data Sharing: Policy Design and Implementation for Research Data Management in Development Research". Research Ideas and Outcomes 3: e21773. doi:10.3897/rio.3.e21773.
- ↑ Neylon, C. (2017). "Support Your Data: A Research Data Management Guide for Researchers". Research Ideas and Outcomes 4: e26439. doi:10.3897/rio.4.e26439.
- ↑ Irawan, D.E.; Vervoort, R.W.; Melzack, G. (19 December 2017). "Open Data Workshop SSEAC Usyd - ITB". Open Science Framework. doi:10.17605/OSF.IO/S76GU. https://osf.io/s76gu/.
- ↑ Digital Curation Center (2014). "Checklist for a Data Management Plan". http://www.dcc.ac.uk/resources/data-management-plans/checklist.
- ↑ Teperek, M.; Mollitt, B.; Southall, J.; Donaldson, M. (23 January 2017). "Wellcome DMP assessment rubric v2.0". Zenodo. doi:10.5281/zenodo.257650. https://zenodo.org/record/257650.
- ↑ University of California Curation Center (2018). "DMPTool". Regents of the University of California. https://dmptool.org/.
- ↑ Neylon, C. (2017). "Data Management Plan: IDRC Data Sharing Pilot Project". Research Ideas and Outcomes 3: e14672. doi:10.3897/rio.3.e14672.
- ↑ Traynor, C. (2017). "Data Management Plan: Empowering Indigenous Peoples and Knowledge Systems Related to Climate Change and Intellectual Property Rights". Research Ideas and Outcomes 3: e15111. doi:10.3897/rio.3.e15111.
- ↑ Wael, R. (2017). "Data Management Plan: HarassMap". Research Ideas and Outcomes 3: e15133. doi:10.3897/rio.3.e15133.
- ↑ Woolfrey, L. (2017). "Data Management Plan: Opening access to economic data to prevent tobacco related diseases in Africa". Research Ideas and Outcomes 3: e14837. doi:10.3897/rio.3.e14837.
- ↑ Bourne, P.E.; Polka, J.K.; Vale, R.D.; Kiley, R. (2017). "Ten simple rules to consider regarding preprint submission". PLoS Computational Biology 13 (5): e1005473. doi:10.1371/journal.pcbi.1005473. PMC PMC5417409. PMID 28472041. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417409.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation, and grammar for improved readability. In some cases important information was missing from the references, and that information was added. The original article listed citations in alphabetical order, while this wiki lists them by order of appearance, by design.