Protecting Identifiers in Human Subjects Data
Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.
Advanced techniques overview
When to consider advanced de-identification techniques
Consider more advanced de-identification techniques when:
- preparing a public access dataset
- de-identification removes too many key variables of interest
- preparing complex quantitative datasets, especially for large samples with multiple variables
Here is an overview of a few of the common advanced techniques:
Collapse categories with low frequencies (low p value). Make broader ranges by creating a broader coding scheme.
Top and bottom coding: Change extreme top & bottom of outlier variables
Microaggregation: Group 3-5 similar records for a problematic (non-range) variable, and replace original value by the means of those records.
Record swapping: For categorical variables, from a random sample of at-risk records, swap values among closely paired participants, to add "noise" increasing anonymity of the overall dataset.
|“Collapsing Data across Observations | SPSS Learning Modules.” n.d. Accessed March 5, 2019. https://stats.idre.ucla.edu/spss/modules/collapsing-data-across-observations/.|
|Domingo-Ferrer, Josep. 2009a. “Data Rank/Swapping.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 620–21. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1497.|
|Domingo-Ferrer, Josep. 2009b. “Microaggregation.” In Encyclopedia of Database Systems, edited by LING LIU and M. TAMER ÖZSU, 1736–37. Boston, MA: Springer US. https://doi.org/10.1007/978-0-387-39940-9_1496.|
|“Top-Coded.” 2018. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Top-coded&oldid=870775236|