Developing and Testing Software for Linking Patient Data from Multiple Sources
Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
Patient-Centered Outcomes Research Institute (PCORI)
Abstract
Background: Comparative effectiveness research (CER) and patient-centered outcomes research (PCOR) routinely use secondary data (eg, insurance claims, health records). Leveraging secondary data requires effective and accurate record linkage (RL), that is, matching the same individuals in different data sets. The absence of common, error-free, unique identifiers across data sources challenges RL and forces the use of identifying information (ie, names) to ensure proper linkage. This, in turn, raises privacy concerns. While automated methods are useful, high-quality RL requires human interaction (eg, parameter settings, building training data sets, validating results). Consequently, managing errors from imperfect and complex real-world data requires human access to identifiable data.
Objectives: Broadly, our objective was to investigate privacy-enhancing RL tools that can facilitate accurate matching with a hybrid human-computer system that strictly controls information disclosure. Specifically, we aimed to (1) design effective information visualizations for RL, (2) determine optimal levels of information disclosure for RL, and (3) develop consensus with patients and stakeholders on what they need to know about how RL is conducted. Using these findings, the main outcomes were to design (1) prototype open-source software, (2) a template privacy statement, (3) an IRB application template, and (4) a template data use
agreement to share information about the software with appropriate stakeholders. Methods: This research used methods from 2 fields. First, we used a human-computer interaction agile software development approach to develop the prototype software called MInimum Necessary Disclosure For Interactive Record Linkage (MiNDFIRL), including controlled user studies (N > 100), expert surveys, and case studies. Second, we used nominal group technique (NGT) focus groups and Delphi studies commonly used in participatory action research to engage stakeholders in the research. These methods were used to understand perceived benefits, risks, and practical concerns with the new privacy-enhancing approach that MiNDFIRL employs. Patients and ELSI (ethical, legal, and social implications), including IRB, experts were engaged to develop the 3 companion documents for iNDFIRL. We then conducted an online survey (N > 400) to obtain public opinion of the developed privacy statement.
Results: For iterative software design and development, the project includes multiple formative evaluations through (i) 2 controlled experiments with volunteer nonexpert participants and (ii) an expert review. The first experiment (study A.1: N = 104) evaluated human decision-making in RL with the visual data-masking technique. A second experiment (study A.2.1: N = 122) focused on the on-demand interactive interface design for incrementally disclosing partial information. Collectively, the results demonstrate the ability to greatly limit the amount of identifying information available to human decision makers (only 7.85% compared with 100% with all data disclosed) without negatively affecting decision quality or completion time. We also conducted an expert review with 6 experts (study A.2.2). Their feedback supports the notion that a level of access to identifying information that is intermediate between “all or nothing” can provide better accuracy than that with no access but more protection than with full access. As a summative evaluation, 2 case studies were conducted to evaluate our approach in more realistic and complex operational scenarios at (i) the University of Texas Health Science Center at Houston (UTH; study B.1) and (ii) the University of Alabama at Birmingham Health System (UAB; study B.2). The studies consisted of RL with electronic health records (N = 10 000 total pairs with 303 manually reviewed pairs) and patient-generated data (N = 1055 total pairs, with 187 manually reviewed pairs), respectively. Both the UTH and UAB results demonstrate that the default disclosure budget for identifying information in MiNDFIRL, at 30%, based on results from the formative studies, was sufficient for most human decision-making in RL. Our
engagement research to develop template companion documents for the MiNDFIRL software included 4 studies: an NGT session with 11 ELSI experts (study D.1), an NGT session with 27 patients (study C.1), a Delphi study with 13 ELSI experts (study D.2), and a Delphi study with 33 patients (study C.2). Generally, we identified consensus across all studies. The potential to reduce risk to the minimum necessary was a main perceived benefit of our approach, while concerns still remained for needed organizational administrative controls (eg, software configuration, and secure system setup) across all studies. In a nationally representative sample (study C.3: N = 470), more than 80% were satisfied with the privacy statement that was developed in a web-based, interactive, frequently asked questions format.
Conclusions: Our controlled experiments demonstrate that properly designed software can enhance privacy while supporting legitimate access for human decision-making. The results also suggest limits to how much data can be hidden before negatively influencing the quality of decisions. We also found that public privacy statements, written to reflect patients’ voices and interests, can increase transparency and improve patient trust. Based on these findings, we designed, implemented, and released the open-source MiNDFIRL prototype software along with 3 companion documents describing the use of the software for high-quality RL to support CER/PCOR.
Limitations: The current prototype software code, MiNDFIRL, needs to be fully developed for use across CER/PCOR. Additionally, the project scope did not include investigating automated algorithms required for a comprehensive hybrid human-computer system. Although we observed thematic saturation from the respondents, our qualitative studies (ie, NGT and Delphi) might not broadly reflect the full range of divergent opinions of all groups. Nonetheless, our large-scale, nationally representative sample did not find any differential preferences across socioeconomic status, providing support for the cocreated frequently asked questions
language.
Description
Keywords
Citation
Kum H-C, Ragan E, Ferdinand A, Schmit C. (2022). Developing and Testing Software for Linking Patient Data from Multiple Sources Patient-Centered Outcomes Research Institute (PCORI). https://doi.org/10.25302/04.2022.ME.160234486