Journal:Restricted data management: The current practice and the future

Full article title	Restricted data management: The current practice and the future
Journal	Journal of Privacy and Confidentiality
Author(s)	Jang, Joy B.; Pienta, Amy; Levenstein, Margaret; Saul, Joe
Author affiliation(s)	Inter-university Consortium for Political and Social Research (ICPSR) at University of Michigan
Primary contact	Email: oyjang at umich dot edu
Year published	2023
Volume and issue	13(2)
Page(s)	1–9
DOI	10.29012/jpc.844
ISSN	2575-8527
Distribution license	Creative Commons Attribution-NonCommercial-NoDeriv 4.0 International
Website	https://journalprivacyconfidentiality.org/index.php/jpc/article/view/844
Download	https://journalprivacyconfidentiality.org/index.php/jpc/article/view/844/753 (PDF)

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Many restricted data managing organizations across the world have adapted the Five Safes framework (i.e., safe data, projects, people, setting, and output) for their management of restricted and confidential data. While the Five Safes have been well integrated throughout the data life cycle, organizations observe several unintended challenges regarding making that data be FAIR (findable, accessible, interoperable, and reusable). In the current study, we review the current practice on restricted data management and discuss challenges and future directions, especially focusing on data use agreements, disclosure risks review, and training. In the future, restricted data managing organizations may need to proactively take into consideration reducing inequalities in access to scientific development, preventing unethical use of data in their management of restricted and confidential data, and managing various types of data.

Keywords: confidentiality, data governance, FAIR, training

Introduction

Since the introduction of the Five Safes in the mid-2010s [Desai, Ritchie, and Welpton, 2016; Ritchie, 2017], many organizations managing restricted data have adopted the framework for the management of restricted and confidential data. The Five Safes framework helps organizations set guidelines for safe data created by data providers, safe projects for public good, safe people who are authenticated data users, safe settings in which data are being used, and safe outputs from analyzing data. The Five Safes have been well-integrated throughout the data life cycle, and have led to good stewardship practices to make scientific data FAIR (findable, accessible, interoperable, and reusable). It also helps multiple stakeholders balance data utilization with protection of subject privacy and data confidentiality. Despite successful implementation of the Five Safes, organizations encounter unintended challenges. In this paper, we review the current practice of restricted data management and discuss challenges and future directions, focusing on data use agreements, disclosure risk review, and training for data users.

While organizations implement multiple modes of data access (e.g., virtual data enclaves [VDEs], physical data enclaves [PDEs], secure encrypted file downloads), our discussion may apply mostly to VDE and PDE. Further, our discourse is centered around quantitative data, although we do not restrict the implications to only that type of data. In other words, even though our discussion on current practices may be largely reliant on our experience with quantitative data accessible via VDE or PDE, the implications of our study may extend to newly emerged data types such as research notes, video, and electroencephalography.

Data use agreements

Data use agreements (DUAs) are risk mitigation tools that clarify expectations among multiple stakeholders. [O’Hara, 2020] DUAs must be entered into before any use or access to data by users, and may require periodic updates. DUAs may contain all Five Safes components: safe data (description of how data have been and will be treated for protection of any disclosure risks); safe people (data users’ credentials); safe projects (research proposals demonstrating the intended data use); safe setting (plans for safe data access and handling); and safe outputs (procedures or rules on output publication and release). For some organizations, DUAs are stand-alone documents containing all five components. Other organizations require quite short DUAs, accompanied by separate materials such as a detailed research proposal, approval or exemption from an Institutional Review Board (IRB), and CVs from participants in the research project. Involvement of multiple stakeholders in DUAs means that DUAs allow for negotiations and pursuit of consensus among parties.

Many organizations are bound by federal, state, and local laws, regulations, or policies reflecting their capability to access direct identifiers in the datasets. DUAs specify terms and conditions for data access and use, and clarify liability issues in advance. This upfront emphasis on DUAs would help mitigate confusion regarding liability in case of data breaches or suspected security incidents. DUAs require data users’ authenticated credentials; some organizations additionally ask for involvement of the researchers’ institutions in DUAs as a leverage to enforce consequences for the institution. [Levenstein, 2020] Not only for legal leverage, but also the involvement of institutional representatives in DUAs would help implement safe use of data by researchers. Research shows that many data users care more about their personal penalties (loss of access and funding, opinions of colleagues) rather than legal ones, if any incident happens. [Green, et al., 2017] Having multiple layers of liability may safeguard data breaches or protocol violations by users. However, involvement of the institutions in the DUAs may impose a hurdle for research teams with collaborators from multiple institutions or from different countries. DUAs for research projects of this nature may have to consider heterogeneous requirements with regard to data privacy, confidentiality, and liability issues, which may cause significant delays in the process of data use.

Below, we discuss four distinctive challenges that organizations encounter with regard to restricted data management: limited opportunities of data access for certain groups of individuals; DUAs for research projects involving multiple institutions; limitations on binding laws against failure to DUA compliance; and costs to access data.

Limited opportunities for data access by certain groups

As described, institutional involvement may help enforce consequences for both the institution and individual researchers. Data users who are affiliated with so-called typical research institutions (e.g., universities, government agencies, research institutes) have an institutional representative involved in the DUA process, and work with organizations without substantial challenges. Most of the processes are seamless, unless stakeholders raise concerns. (Even with concerns, the most serious challenge may be a delay in the process.) However, a requirement of institutional involvement can impose an insurmountable hurdle for those without an institutional affiliation, such as freelance journalists or students without academic advisors or from institutions with no experience. Researchers and institutions negotiate details in DUAs and pursue consensus with data managing organizations, which could be a tremendous burden for small institutions. While institutional involvement is meant to help keep safe people safer, it may have unintentionally excluded researchers without that leverage. An exemption for those who have been authorized and been good users at other organizations may need to be considered, and a template agreement that may mitigate the burdens should be available. [Levenstein, et al., 2018; O’Hara, 2020] Effective user training for ethical and scientific use of data may be helpful to alleviate concerns regarding data misuse by those with limited experience.

DUAs (or other supplement materials) require safe settings to access restricted data. Safe setting in DUAs designates a space in which no authorized views are allowable, for instance, an office space with a door that lacks a window. Shared space is not accepted by some organizations as a secure setting. Again, this requirement may impose a barrier for those with limited resources, such as students who would access restricted data in a shared office or cubicle. Organizations may need to consider embracing those who have limited resources by accommodating their needs (e.g., using a privacy screen for those who access data in a shared office).

DUAs for research projects involving multiple institutions

When researchers from multiple institutions collaborate in a single research project, each institution would enter into the DUAs. DUAs clarify expectations and responsibilities for each institution according to the research plan. The process is often complicated when institutions are located in different countries (e.g., legitimacy of credential authentication or IRB approval in different languages). O’Hara [2020] suggests considering other forms of documentation in multi-site research projects, such as a memorandum of understanding (MOU) and identification of conflicts of interest. In some cases, requiring identical DUAs with all participating institutions, although requiring extensive time to complete, may reduce confusion as compared to differing DUAs across institutions. Ultimately, to streamline the process of multi-site research projects, it may be helpful for organizations to consider incentives for good data users in different projects or even in different organizations. For example, the Research Passport of the Inter-university Consortium for Political and Social Research (ICPSR) expedites access to restricted data by giving researchers credits and visibility for “safe” actions in their past experiences with restricted data. [Levenstein, et al., 2018] This type of verification on users’ cumulative “safe” actions would tremendously help the procedures of DUAs across multiple institutions.

Limitations on binding laws against DUA non-compliance

Failure to comply with a DUA may result in immediate termination of data access and further actions that depend on the severity of the failure. Organizations establish procedures to respond to data security and breach incidents; some funders require a one-hour reporting and procedures to minimize the damage of the data breach or confidentiality disclosure. In the United States, violation of the Health Insurance Portability and Accountability Act (HIPAA) privacy standards can impose a civil monetary penalty on the individual by the Department of Health and Human Services. Organizations bound by specific laws such as HIPAA must follow the high-level legal boundary. Nonetheless, most data security incidents are unintentional or inadvertent violations of the protocol. They may pose minimal risk for subjects in datasets, and thus better be handled with effective user training. Organizations may better consider DUAs as a tool for all stakeholders to share responsibilities for data confidentiality (e.g., a community model) [Green, et al., 2017] rather than the one for policing or punishing one party (e.g., a policing model). [Green, et al., 2017]

Costs to access restricted data

Even marginal costs of accessing data can be burdensome to researchers, but such costs are also important to consider for organizations. Data access costs include staff efforts to set up the access and to create datasets for users. The costs could unintentionally exclude some groups of researchers, such as junior scholars without research funds. Organizations and funding agencies could proactively intervene by waiving the costs of data access for researchers with limited resources. Doing so would help achieve Open Science [OECD, 2015]—aiming to share data with minimal barriers for all researchers from different backgrounds.

Disclosure review practices

Safe output refers to statistical products created from the restricted and/or sensitive data that are being vetted and approved as non-disclosive. Organizations help researchers utilize restricted data as effectively as possible without compromising data privacy and confidentiality. Safe output by safe people must go through a vetting process for disclosure risks. Disclosure review rules and procedures are set up in earlier steps of the data access process, such as the DUAs. Some data providers prefer to set up standards of disclosure avoidance rules and procedures with organizations in the data depositing process. Data providers and organizations also often discuss dissemination modes and tiers of access to establish the disclosure avoidance rules and procedures.

Disclosure review rules and procedures vary by types of data and access modes. Alves and Ritchie [2020] articulate two approaches to managing output-vetting: “rules-based” and “principle-based” approaches. The rules-based approach establishes a certain set of strict rules regarding disclosive information and scrutinizes research outputs created from restricted data based on the rules. On the other hand, the principle-based approach allows flexible negotiation between researchers and output vetting staff. The goal of the organizations is to implement efficient and effective procedures to protect data confidentiality and minimize disclosure risks, as well as to maximize data utilization. [Griffiths, et al., 2019; Levenstein, 2019] Most organizations apply the rules-based output vetting approach, with a certain level of flexibility, to various data types.

We review below the current practice and future directions in four domains: common output vetting requirements at organizations; reviewers of statistical outputs; automatic disclosure review procedure; and self-vetting that relies on “safe setting” and “safe people.”

Outputting vetting requirements

Organizations set up a standardized procedure for output vetting, including but not limited to output format, contents, and timeline to process each request. To illustrate, Table 1 summarizes output vetting requirements and considerations currently in place at many data archives at ICPSR. Most organizations have their own requirements and considerations in the restricted data use process. While standardizing the process and requirements could help streamline the procedures, it seems implausible due to different requirement by funders and data providers.

Item	Requirements	Examples
Table 1. Output vetting requirements and considerations at ICPSR.
Format	* Presentation-ready format required/preferred (.pdf, .docx, .xlsx).	* Raw outputs from statistical packages (e.g., SAS log, Statalog-files, M-Plus log) not accepted.
Contents	* A description of the sample, sub-sample, analytic approach, and definitions of variables used in the analyses. * Summary statistics for variables used in the analysis. * Checklist (help self-vet before sending it to the vetting staff). * Supporting documents (programming files).	* Minimum cell size threshold is clearly described in the output vetting instruction. * Minimum cell size threshold differs by type of dataset and linkage capability.
Timeline	* Depends on the output, but most vetting is completed within 10 business days.	* Missing requirements, insufficient supporting documents or materials would significantly extend the timeline.

Reviewers of the statistical outputs

It is preferred that organizations have output-vetting reviewers with background in statistics or subject areas, but this is not a requirement. More important aspects are: 1) independence of the reviewers; 2) the four eyes principle; and 3) manageable workload without excessive pressure. [Griffiths, et al., 2019]

Most organizations have designated individuals responsible for output vetting. For example, there are at least five experts at ICPSR all the time, with two or three back-ups, who vet outputs created from VDE or PDE. These experts are mostly ICPSR staff members who are not affiliated with any research projects of users (i.e., displaying independence). To bolster the confidence regarding whether to release output, organizations have a group of reviewers (four eyes principle; managing workload). Some organizations operate a committee who discuss the risks of data confidentiality and privacy from research outputs. The committee usually consists of a group of experts to oversee data confidentiality and evaluate disclosure risks from the use of restricted data. For example, the ICPSR Disclosure Review Board (DRB) fills a leadership and scholarly role in the disclosure avoidance community, and serves as a decision-making body within the ICPSR with regard to disclosure risks and exceptions to existing policies. The ICPSR DRB consists of a Chair (ICPSR Privacy and Security Officer), Vice-Chair, and 10 experts within and outside the organization. Individual ICPSR reviewers can query the DRB about disclosure risks on outputs and defer the approval decision to the DRB. Further, DRB reviews the ICPSR disclosure rules in light of new regulations and changes to the wider data environment, assesses new disclosure reduction methods and technologies for possible adoption, and develops rules around them. The ICPSR DRB convenes every month.

Having a group of experts (e.g., a committee) who can provide a second set of eyes on disclosure risks would be beneficial with regard to confidentiality and privacy protection, but it could create frustration for data users on a tight timeline. It is important for organizations to consider the procedure of committee involvement to be flexible, e.g., an ad hoc subcommittee available for immediate consultation on specific requests.

Automated disclosure review

Organizations try to standardize the process of disclosure review despite disparate requirements by data type, funding agencies, and data depositors that hamper progress. High-level standardization of the disclosure review process helps streamline the vetting process, and may save the vetting timeline. In terms of vetting guidelines, standardization would be easy for the rules-based approach (setting common strict rules across datasets and organizations), but it could diminish the utilization of data if some of the output were unnecessarily determined as risky. Standardizing output vetting using the principle-based approach may be easier to implement; having a rule of thumb to vet each output and releasing if risks are negligible. [Griffiths, et al., 2019] One caveat regarding standardization of the principle-based approach is that organizations may want highly-qualified expert reviewers to assess the disclosure risks of statistical outputs.

Most organizations support a pool of experts to perform disclosure risk reviews, which is often time- and resource-consuming. Instead, organizations may consider an automated disclosure review system since output checking for disclosure risks is not necessarily a statistical matter but an operational matter. [Alves and Ritchie, 2020] In fact, some organizations have already implemented a machine-driven output checking for disclosure risks with regard to relatively simple matters such as minimum cell thresholds, although other organizations still rely on humans for the output checking. Stocchi and Bujnowska [2021] summarized the automatic Stata programming developed by Ritchie et al. [2021], suggesting that the automated checking may work more effectively by a joint effort with expert personnel. Ritchie et al. [2021] also pointed out that automated tools may over-protect data by treating every possible case as an actual risk (which might compromise the utilization of restricted data). Also, the tool may over- or under-protect disclosure risks due to its inability to determine the context of data use. [Ritchie, et al., 2021] A combination of the automated review process with expert check-ups might be most effective. Further, safe output created by safe users may help the automatic disclosure review system work the best. Organizations may invest in user training for good output preparation and checking behaviors, which eventually saves reviewers’ efforts and other resources.

Self-vetting that relies on "safe setting" and "safe people"

Outputs created within a VDE or PDE must go through a vetting process before retrieval, either by experts or by automated vetting system. On the other hand, organizations have to rely on an output self-vetting by data users who access data via a secure download method. Organizations do not scrutinize each output created from the secure download but still strive to ensure “safe setting” and “safe people” by providing training and guidelines. Audits on data management and use in safe settings by safe people are also conducted by many organizations. However, given greater risks of disclosure with secure encrypted data download dissemination, efforts for safe data may be required.

Training

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.