Protecting Identifiers in Human Subjects Data

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Software for disclosure protection

The JHU Data Services website has a list of software that aids in identifying and removing PII/PHI. Here is a sample:

ARX Data Anonymization Tool

“A comprehensive software for risk- and utility-based privacy-preserving microdata publishing” developed at Technical University of Munich, Germany. An excellent GUI-based set of advanced statistical techniques, that still require some expertise in applying properly.

DICOMCleaner
a free open source tool with a user interface for importing, “cleaning” and saving sets of DICOM instances (files)

NLM-Scrubber 
"A freely available, HIPAA compliant, clinical text de-identification tool designed and developed at the National Library of Medicine.”  For records converted to ASCII text, runs on a command line/terminal interface in Linux or Windows

Advanced techniques overview

When to consider advanced de-identification techniques

Consider more advanced de-identification techniques when:

  • preparing a public access dataset
  • de-identification removes too many key variables of interest
  • preparing complex quantitative datasets, especially for large samples with multiple variables
Advanced de-identification is not always a "do-it-yourself" activity. If your project requires advanced de-identification, a professional statistician may be required, and approval of datasets for disclosure risk before public release. Investigators and researchers should ensure that they follow data governance policies and procedures that apply to their data. Medical and health research subject to oversite by the JHM Data Trust Council must follow additional requirements for de-identification and disclosure review. Please refer to the JHM Data Trust Council section of this guide .

Here is an overview of a few of the common advanced techniques:

​Collapse categories with low frequencies (low p value). Make broader ranges by creating a broader coding scheme.

Top and bottom coding: Change extreme top & bottom of outlier variables

Microaggregation: Group 3-5 similar records for a problematic (non-range) variable, and replace original value by the means of those records. 

Copyright © 2019, by Johns Hopkins Data Services. No reproduction without permission.

Record swapping: For categorical variables, from a random sample of at-risk records, swap values among closely paired participants, to add "noise" increasing anonymity of the overall dataset.

Sources
“Collapsing Data across Observations | SPSS Learning Modules.” n.d. Accessed March 5, 2019. https://stats.idre.ucla.edu/spss/modules/collapsing-data-across-observations/.
Domingo-Ferrer, Josep. 2009a. “Data Rank/Swapping.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 620–21. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1497.
Domingo-Ferrer, Josep. 2009b. “Microaggregation.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 1736–37. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1496.
“Top-Coded.” 2018. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Top-coded&oldid=870775236