Protecting Human Subject Identifiers

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Data Services Profile

We are here to help you find, use, manage, visualize and share your data. Contact us to schedule a consultation. View and register for upcoming workshops. Visit our website to learn more about our services.

Advanced techniques overview

When to consider advanced de-identification techniques

Consider more advanced de-identification techniques when:

  • preparing a public access dataset
  • de-identification removes too many key variables of interest
  • preparing complex quantitative datasets, especially for large samples with multiple variables
Advanced de-identification is not always a "do-it-yourself" activity. If your project requires advanced de-identification, a professional statistician may be required, and approval of datasets for disclosure risk before public release. Investigators and researchers should ensure that they follow data governance policies and procedures that apply to their data. Medical and health research subject to oversite by the JHM Data Trust Council must follow additional requirements for de-identification and disclosure review. Please refer to the JHM Data Trust Council section of this guide .

Here is an overview of a few of the common advanced techniques:

Collapse categories with low frequencies (low p value). Make broader ranges by creating a broader coding scheme.

“Collapsing Data across Observations | SPSS Learning Modules.” n.d. Accessed March 5, 2019. https://stats.idre.ucla.edu/spss/modules/collapsing-data-across-observations/.

Date Shifting  If compete numeric dates are needed for calculation purposes, rather than truncating them to a year, use date shifting of plus or minus 182 days (equivalent of truncating to a year. Shifts of less than a year may still be considered direct identifiers.) Dates may be shifted by a fixed amount, but randomized amounts are generally more secure. For participants with multiple events, a random shift value may be assigned to each participant.

Date shifting: Hripcsak G, Mirhaji P, Low AF, Malin BA. Preserving temporal relations in clinical data while maintaining privacy. J Am Med Inform Assoc. 2016;23(6):1040-1045. doi:10.1093/jamia/ocw001

Top and bottom coding: Change extreme top & bottom of outlier variables

Microaggregation: Group 3-5 similar records for a problematic (non-range) variable, and replace original value by the means of those records. 

Copyright © 2019, by Johns Hopkins Data Services. No reproduction without permission.

Domingo-Ferrer, Josep. 2009b. “Microaggregation.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 1736–37. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1496.

Record swapping: For categorical variables, from a random sample of at-risk records, swap values among closely paired participants, to add "noise" increasing anonymity of the overall dataset.

Domingo-Ferrer, Josep. 2009a. “Data Rank/Swapping.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 620–21. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1497.