Protecting Human Subject Identifiers

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Johns Hopkins Data Services has compiled a list of de-identification software tools and applications that can be used in de-identifying research data for public sharing. The information on this page is provided for informational purposes only and does not constitute an endorsement of any particular tool for data de-identification. In general, no software provides automatic de-identification. All require care in setting up, understanding what they do, and assessing results for changes and remaining utility of the dataset for its purpose. We at Data Services do not support this software nor have necessarily tried them, but we would appreciate feedback if you have tried them, and any recommendations of tools not listed here. Investigators and researchers should ensure that they follow data governance policies and procedures that apply to their data. Refer to your IRB and the Johns Hopkins Privacy Office for further information. 

Tabular and Structured Data De-identification Tools

In general, de-identification of tabular data for statistical research or for verification and reuse of shared data requires some expertise by the user to understand the statistical transformations to the data and assess the remaining utility of the data for its purpose.
 
sdcMicro 
sdcApp  
  • Description: Graphical user interface of sdcMicro allowing you to apply disclosure limitation techniques to microdata even if you are not an expert in the R programming language.  

The sdcTable package in R 
  • Description: R package for applying methods for statistical disclosure control in tabular data such as primary and secondary cell suppression.  

  • Download it: https://github.com/sdcTools/sdcTable 

  • Additional Information 

ARX Data Anonymization Tool 
  • Description: It is a comprehensive open-source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.” It uses a graphical user interface to remove and transform personal data from tabular data. Available on Windows, Mac, and Linux.  

  • Download it: https://arx.deidentifier.org/downloads/ 

  • Additional info: 

Amnesia 
  • Description: Developed in the EU to meet GDPR guidelines for data anonymization, this tool uses various de-identification methods such as masking, pseudo-anonymization, K-anonymity algorithm, etc. so that you can share your results. 

  • Download it: https://amnesia.openaire.eu/download.html 

  • Additional info: 

 
mu-Argus 5.1 
tau-Argus 4.1 
 
Privacy Analytics Eclipse
  • Software description: “Eclipse is a modular enterprise software platform that transforms vast stores of sensitive data, safely and at scale, in an automated process. It empowers you to anonymize structured data of any size and sensitivity to the highest possible standard. ” This is proprietary commercial software generally intended as larger scale enterprise installation for medical record platforms.

  • Intended purpose: Working with structured medical recor

Cornell Anonymization Toolkit (CAT)
  • Software description: “designed for interactively anonymizing published dataset to limit identification disclosure of records under various attacker models”  (Has not been updated in the last 10 years.)

  • Intended purpose: Medical records – tabular data

The University of Texas at Dallas Anonymization Toolbox
  • Software description: a researcher-compiled implementation (from UT Dallas Data Security and Privacy Lab) of various anonymization methods into a toolbox for public use by researchers.  (May not have been updated since 2012.)

  • Intended purpose: Unstructured text files

De-identification tools within applications & coding tools

sdcMicro 
sdcApp  
  • Description: Graphical user interface of sdcMicro allowing you to apply disclosure limitation techniques to microdata even if you are not an expert in the R programming language.  

The sdcTable package in R 

REDCap provides advanced de-identification options that can be optionally used when exporting data, such as removing known Identifier fields, removing invalidated text fields, notes fields, or date fields, date shifting and hashing of the record names. These options provide greater security and data protection when a user is exporting sensitive data out of REDCap.

  • Protecting Sensitive Participant Information (PHI) in REDCap from Lifespan Biostatistics, Epidemiology, Research Design, and Informatics (BERDI):  An excellent illustrated overview of REDCap de-identification features
  • Date shifting: As part of the de-identification options in REDCap, date fields may be shifted to hide the actual dates. Choosing the option will shift dates by a chosen amount for each record, preserving the interval date intervals. Date shifting leaves the database record intact and will not affect the actual saved dates in the database. Data exports retain the shifted dates. Dates can be shifted up to 364 days, and applies an algorithm based on assigning a value to each Subject ID, making the amount of shift unique to each record while retaining internally consistent durations among events. Shifting dates at least plus or minus 180 days meets HIPAA Safe Harbor criteria for de-identifying dates.

Digital Image De-identification Tools

JHM AI & Data Trust Guidelines for scanned image de-identification

  • Overview and tools list developed by JHU SOM eRadiology

DICOMCleaner 

  • Software description: “DicomCleaner™ is a free, open-source tool with a user interface for importing, process of removing and/or replacing information in the DICOM header, and saving sets of DICOM instances (files)” 

  • Intended purpose: Medical Images in DICOM (Digital Imaging and Communications in Medicine) format 

  • Mac and Windows versions: http://www.osirix-ukusergroup.org/dicom-cleaner  

DicomAnonymizer 

  • A Python package to anonymize DICOM files according to DICOM standards 

  • Need to know how to use Python to use this package 

The De-identification Toolbox (formerly DeID)  

  • A data sharing tool for neuroimaging studies 

  • A java program to remove identifying information in neuroimaging datasets 

  • Related article: Song, X., Wang, J., Wang, A., Meng, Q., Prescott, C., Tsu, L., & Eckert, M. A. (2015). DeID – a data sharing tool for neuroimaging studies. Frontiers in Neuroscience, 9. https://doi.org/10.3389/fnins.2015.00325  

deepdefacer  

  • Automatic Removal of Facial Features via Deep Learning 

  • A MRI anonymization tool written in Python 

  • Can quickly deface 3D MRI images of any resolution and size 

  • Related article: Khazane, A., Hoachuck, J., Gorgolewski, K. J., Poldrack, R. A. (2022) DeepDefacer: Automatic Removal of Facial Features via U-Net Image Segmentation. [preprint] arXiv, https://doi.org/10.48550/arXiv.2205.15536  

MRI_deface:   

  • Automated Defacing Tools 

  • Also see MiDeFace, a newer tool for defacing 

  • Related article: Bischoff-Grethe, A., Ozyurt, I. B., Busa, E., Quinn, B. T., Fennema-Notestine, C., Clark, C. P., Morris, S., Bondi, M. W., Jernigan, T. L., Dale, A. M., Brown, G. G., & Fischl, B. (2007). A technique for the deidentification of structural brain MR images. Human Brain Mapping, 28(9), 892–903. https://doi.org/10.1002/hbm.20312  

Pydeface:  

  • A tool to remove facial structure from MRI images 

  • A Python package 

  • Related article: Omer Faruk Gulban, Dylan Nielson, john lee, Russ Poldrack, Chris Gorgolewski, Vanessasaurus, & Chris Markiewicz. (2022). poldracklab/pydeface: PyDeface v2.0.2 (v2.0.2). Zenodo. https://doi.org/10.5281/zenodo.6856482   

Quickshear:  

  • Uses a skull stripped version of an anatomical images as a reference to deface the unaltered anatomical image 

  • A Python package 

  • Related article: Schimke, Nakeisha, and John Hale. "Quickshear defacing for neuroimages." Proceedings of the 2nd USENIX conference on Health security and privacy. USENIX Association, 2011. 

BIDSonym:  

  • Related article: Herholz, P., Ludwig, R. M., & Poline, J. (2021, January 24). BIDSonym - a BIDSapp for the pseudo-anonymization of neuroimaging datasets. https://doi.org/10.31234/osf.io/3aknq  

Qualitative and Unstructured Text Data De-identification Tools

NLM-Scrubber
  • Software description: “A freely available, HIPAA compliant, clinical text de-identification tool designed and developed at the National Library of Medicine.”  For records converted to ASCII text, runs on a command line/terminal interface in Linux or Windows

  • Intended purpose: Uses natural language processing to automatically redact direct identifiers typically found in medical records, including addresses below state level, names, dates, and alphanumeric identifiers such as patient account numbers. It attempts to follow HIPAA rules for levels of specificity to retain or remove (i.e., city but not state.)

deid software package
  • Software description: “includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records”

  • Intended purpose: For free text in medical records

UCSF Philter
Privacy Analytics Unstructured Text Anonymization
  • Software description: “Turn personal information in unstructured text—from medical reports to customer feedback—into format-preserved, compliant data. Our offering helps you innovate while protecting privacy.” This is a commercial product intended usually for enterprise purchases for large scale text record systems, specializing in medical records. Overview PDFIntended purpose: Unstructured medical records