Difference between revisions of "Journal:Restricted data management: The current practice and the future"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 57: Line 57:


==Disclosure review practices==
==Disclosure review practices==
Safe output refers to statistical products created from the restricted and/or sensitive data that are being vetted and approved as non-disclosive. Organizations help researchers utilize restricted data as effectively as possible without compromising data privacy and confidentiality. Safe output by safe people must go through a vetting process for disclosure risks. Disclosure review rules and procedures are set up in earlier steps of the data access process, such as the DUAs. Some data providers prefer to set up standards of disclosure avoidance rules and procedures with organizations in the data depositing process. Data providers and organizations also often discuss dissemination modes and tiers of access to establish the disclosure avoidance rules and procedures.
Disclosure review rules and procedures vary by types of data and access modes. Alves and Ritchie [2020] articulate two approaches to managing output-vetting: “rules-based” and “principle-based” approaches. The rules-based approach establishes a certain set of strict rules regarding disclosive information and scrutinizes research outputs created from restricted data based on the rules. On the other hand, the principle-based approach allows flexible negotiation between researchers and output vetting staff. The goal of the organizations is to implement efficient and effective procedures to protect data confidentiality and minimize disclosure risks, as well as to maximize data utilization. [Griffiths, et al., 2019; Levenstein, 2019] Most organizations apply the rules-based output vetting approach, with a certain level of flexibility, to various data types.
We review below the current practice and future directions in four domains: common output vetting requirements at organizations; reviewers of statistical outputs; automatic disclosure review procedure; and self-vetting that relies on “safe setting” and “safe people.”
===Outputting vetting requirements===
Organizations set up a standardized procedure for output vetting, including but not limited to output format, contents, and timeline to process each request. To illustrate, Table 1 summarizes output vetting requirements and considerations currently in place at many data archives at ICPSR. Most organizations have their own requirements and considerations in the restricted data use process. While standardizing the process and requirements could help streamline the procedures, it seems implausible due to different requirement by funders and data providers.
{|
| style="vertical-align:top;" |
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | colspan="3" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 1.''' Output vetting requirements and considerations at ICPSR.
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Item
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Requirements
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Examples
|- 
  | style="background-color:white; padding-left:10px; padding-right:10px;" |'''Format'''
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* Presentation-ready format required/preferred (.pdf, .docx, .xlsx).
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* Raw outputs from statistical packages (e.g., SAS log, Statalog-files, M-Plus log) not accepted.
|- 
  | style="background-color:white; padding-left:10px; padding-right:10px;" |'''Contents'''
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* A description of the sample, sub-sample, analytic approach, and definitions of variables used in the analyses.<br />* Summary statistics for variables used in the analysis.<br />* Checklist (help self-vet before sending it to the vetting staff).<br />* Supporting documents (programming files).
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* Minimum cell size threshold is clearly described in the output vetting instruction.<br />* Minimum cell size threshold differs by type of dataset and linkage capability.
|-   
  | style="background-color:white; padding-left:10px; padding-right:10px;" |'''Timeline'''
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* Depends on the output, but most vetting is completed within 10 business days.
  | style="background-color:white; padding-left:10px; padding-right:10px;" |* Missing requirements, insufficient supporting documents or materials would significantly extend the timeline.
|-   
|}
|}





Revision as of 21:00, 29 April 2024

Full article title Restricted data management: The current practice and the future
Journal Journal of Privacy and Confidentiality
Author(s) Jang, Joy B.; Pienta, Amy; Levenstein, Margaret; Saul, Joe
Author affiliation(s) Inter-university Consortium for Political and Social Research (ICPSR) at University of Michigan
Primary contact Email: oyjang at umich dot edu
Year published 2023
Volume and issue 13(2)
Page(s) 1–9
DOI 10.29012/jpc.844
ISSN 2575-8527
Distribution license Creative Commons Attribution-NonCommercial-NoDeriv 4.0 International
Website https://journalprivacyconfidentiality.org/index.php/jpc/article/view/844
Download https://journalprivacyconfidentiality.org/index.php/jpc/article/view/844/753 (PDF)

Abstract

Many restricted data managing organizations across the world have adapted the Five Safes framework (i.e., safe data, projects, people, setting, and output) for their management of restricted and confidential data. While the Five Safes have been well integrated throughout the data life cycle, organizations observe several unintended challenges regarding making that data be FAIR (findable, accessible, interoperable, and reusable). In the current study, we review the current practice on restricted data management and discuss challenges and future directions, especially focusing on data use agreements, disclosure risks review, and training. In the future, restricted data managing organizations may need to proactively take into consideration reducing inequalities in access to scientific development, preventing unethical use of data in their management of restricted and confidential data, and managing various types of data.

Keywords: confidentiality, data governance, FAIR, training

Introduction

Since the introduction of the Five Safes in the mid-2010s [Desai, Ritchie, and Welpton, 2016; Ritchie, 2017], many organizations managing restricted data have adopted the framework for the management of restricted and confidential data. The Five Safes framework helps organizations set guidelines for safe data created by data providers, safe projects for public good, safe people who are authenticated data users, safe settings in which data are being used, and safe outputs from analyzing data. The Five Safes have been well-integrated throughout the data life cycle, and have led to good stewardship practices to make scientific data FAIR (findable, accessible, interoperable, and reusable). It also helps multiple stakeholders balance data utilization with protection of subject privacy and data confidentiality. Despite successful implementation of the Five Safes, organizations encounter unintended challenges. In this paper, we review the current practice of restricted data management and discuss challenges and future directions, focusing on data use agreements, disclosure risk review, and training for data users.

While organizations implement multiple modes of data access (e.g., virtual data enclaves [VDEs], physical data enclaves [PDEs], secure encrypted file downloads), our discussion may apply mostly to VDE and PDE. Further, our discourse is centered around quantitative data, although we do not restrict the implications to only that type of data. In other words, even though our discussion on current practices may be largely reliant on our experience with quantitative data accessible via VDE or PDE, the implications of our study may extend to newly emerged data types such as research notes, video, and electroencephalography.

Data use agreements

Data use agreements (DUAs) are risk mitigation tools that clarify expectations among multiple stakeholders. [O’Hara, 2020] DUAs must be entered into before any use or access to data by users, and may require periodic updates. DUAs may contain all Five Safes components: safe data (description of how data have been and will be treated for protection of any disclosure risks); safe people (data users’ credentials); safe projects (research proposals demonstrating the intended data use); safe setting (plans for safe data access and handling); and safe outputs (procedures or rules on output publication and release). For some organizations, DUAs are stand-alone documents containing all five components. Other organizations require quite short DUAs, accompanied by separate materials such as a detailed research proposal, approval or exemption from an Institutional Review Board (IRB), and CVs from participants in the research project. Involvement of multiple stakeholders in DUAs means that DUAs allow for negotiations and pursuit of consensus among parties.

Many organizations are bound by federal, state, and local laws, regulations, or policies reflecting their capability to access direct identifiers in the datasets. DUAs specify terms and conditions for data access and use, and clarify liability issues in advance. This upfront emphasis on DUAs would help mitigate confusion regarding liability in case of data breaches or suspected security incidents. DUAs require data users’ authenticated credentials; some organizations additionally ask for involvement of the researchers’ institutions in DUAs as a leverage to enforce consequences for the institution. [Levenstein, 2020] Not only for legal leverage, but also the involvement of institutional representatives in DUAs would help implement safe use of data by researchers. Research shows that many data users care more about their personal penalties (loss of access and funding, opinions of colleagues) rather than legal ones, if any incident happens. [Green, et al., 2017] Having multiple layers of liability may safeguard data breaches or protocol violations by users. However, involvement of the institutions in the DUAs may impose a hurdle for research teams with collaborators from multiple institutions or from different countries. DUAs for research projects of this nature may have to consider heterogeneous requirements with regard to data privacy, confidentiality, and liability issues, which may cause significant delays in the process of data use.

Below, we discuss four distinctive challenges that organizations encounter with regard to restricted data management: limited opportunities of data access for certain groups of individuals; DUAs for research projects involving multiple institutions; limitations on binding laws against failure to DUA compliance; and costs to access data.

Limited opportunities for data access by certain groups

As described, institutional involvement may help enforce consequences for both the institution and individual researchers. Data users who are affiliated with so-called typical research institutions (e.g., universities, government agencies, research institutes) have an institutional representative involved in the DUA process, and work with organizations without substantial challenges. Most of the processes are seamless, unless stakeholders raise concerns. (Even with concerns, the most serious challenge may be a delay in the process.) However, a requirement of institutional involvement can impose an insurmountable hurdle for those without an institutional affiliation, such as freelance journalists or students without academic advisors or from institutions with no experience. Researchers and institutions negotiate details in DUAs and pursue consensus with data managing organizations, which could be a tremendous burden for small institutions. While institutional involvement is meant to help keep safe people safer, it may have unintentionally excluded researchers without that leverage. An exemption for those who have been authorized and been good users at other organizations may need to be considered, and a template agreement that may mitigate the burdens should be available. [Levenstein, et al., 2018; O’Hara, 2020] Effective user training for ethical and scientific use of data may be helpful to alleviate concerns regarding data misuse by those with limited experience.

DUAs (or other supplement materials) require safe settings to access restricted data. Safe setting in DUAs designates a space in which no authorized views are allowable, for instance, an office space with a door that lacks a window. Shared space is not accepted by some organizations as a secure setting. Again, this requirement may impose a barrier for those with limited resources, such as students who would access restricted data in a shared office or cubicle. Organizations may need to consider embracing those who have limited resources by accommodating their needs (e.g., using a privacy screen for those who access data in a shared office).

DUAs for research projects involving multiple institutions

When researchers from multiple institutions collaborate in a single research project, each institution would enter into the DUAs. DUAs clarify expectations and responsibilities for each institution according to the research plan. The process is often complicated when institutions are located in different countries (e.g., legitimacy of credential authentication or IRB approval in different languages). O’Hara [2020] suggests considering other forms of documentation in multi-site research projects, such as a memorandum of understanding (MOU) and identification of conflicts of interest. In some cases, requiring identical DUAs with all participating institutions, although requiring extensive time to complete, may reduce confusion as compared to differing DUAs across institutions. Ultimately, to streamline the process of multi-site research projects, it may be helpful for organizations to consider incentives for good data users in different projects or even in different organizations. For example, the Research Passport of the Inter-university Consortium for Political and Social Research (ICPSR) expedites access to restricted data by giving researchers credits and visibility for “safe” actions in their past experiences with restricted data. [Levenstein, et al., 2018] This type of verification on users’ cumulative “safe” actions would tremendously help the procedures of DUAs across multiple institutions.

Limitations on binding laws against DUA non-compliance

Failure to comply with a DUA may result in immediate termination of data access and further actions that depend on the severity of the failure. Organizations establish procedures to respond to data security and breach incidents; some funders require a one-hour reporting and procedures to minimize the damage of the data breach or confidentiality disclosure. In the United States, violation of the Health Insurance Portability and Accountability Act (HIPAA) privacy standards can impose a civil monetary penalty on the individual by the Department of Health and Human Services. Organizations bound by specific laws such as HIPAA must follow the high-level legal boundary. Nonetheless, most data security incidents are unintentional or inadvertent violations of the protocol. They may pose minimal risk for subjects in datasets, and thus better be handled with effective user training. Organizations may better consider DUAs as a tool for all stakeholders to share responsibilities for data confidentiality (e.g., a community model) [Green, et al., 2017] rather than the one for policing or punishing one party (e.g., a policing model). [Green, et al., 2017]

Costs to access restricted data

Even marginal costs of accessing data can be burdensome to researchers, but such costs are also important to consider for organizations. Data access costs include staff efforts to set up the access and to create datasets for users. The costs could unintentionally exclude some groups of researchers, such as junior scholars without research funds. Organizations and funding agencies could proactively intervene by waiving the costs of data access for researchers with limited resources. Doing so would help achieve Open Science [OECD, 2015]—aiming to share data with minimal barriers for all researchers from different backgrounds.

Disclosure review practices

Safe output refers to statistical products created from the restricted and/or sensitive data that are being vetted and approved as non-disclosive. Organizations help researchers utilize restricted data as effectively as possible without compromising data privacy and confidentiality. Safe output by safe people must go through a vetting process for disclosure risks. Disclosure review rules and procedures are set up in earlier steps of the data access process, such as the DUAs. Some data providers prefer to set up standards of disclosure avoidance rules and procedures with organizations in the data depositing process. Data providers and organizations also often discuss dissemination modes and tiers of access to establish the disclosure avoidance rules and procedures.

Disclosure review rules and procedures vary by types of data and access modes. Alves and Ritchie [2020] articulate two approaches to managing output-vetting: “rules-based” and “principle-based” approaches. The rules-based approach establishes a certain set of strict rules regarding disclosive information and scrutinizes research outputs created from restricted data based on the rules. On the other hand, the principle-based approach allows flexible negotiation between researchers and output vetting staff. The goal of the organizations is to implement efficient and effective procedures to protect data confidentiality and minimize disclosure risks, as well as to maximize data utilization. [Griffiths, et al., 2019; Levenstein, 2019] Most organizations apply the rules-based output vetting approach, with a certain level of flexibility, to various data types.

We review below the current practice and future directions in four domains: common output vetting requirements at organizations; reviewers of statistical outputs; automatic disclosure review procedure; and self-vetting that relies on “safe setting” and “safe people.”

Outputting vetting requirements

Organizations set up a standardized procedure for output vetting, including but not limited to output format, contents, and timeline to process each request. To illustrate, Table 1 summarizes output vetting requirements and considerations currently in place at many data archives at ICPSR. Most organizations have their own requirements and considerations in the restricted data use process. While standardizing the process and requirements could help streamline the procedures, it seems implausible due to different requirement by funders and data providers.

Table 1. Output vetting requirements and considerations at ICPSR.
Item Requirements Examples
Format * Presentation-ready format required/preferred (.pdf, .docx, .xlsx). * Raw outputs from statistical packages (e.g., SAS log, Statalog-files, M-Plus log) not accepted.
Contents * A description of the sample, sub-sample, analytic approach, and definitions of variables used in the analyses.
* Summary statistics for variables used in the analysis.
* Checklist (help self-vet before sending it to the vetting staff).
* Supporting documents (programming files).
* Minimum cell size threshold is clearly described in the output vetting instruction.
* Minimum cell size threshold differs by type of dataset and linkage capability.
Timeline * Depends on the output, but most vetting is completed within 10 business days. * Missing requirements, insufficient supporting documents or materials would significantly extend the timeline.


References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.