Protecting Human Subject Identifiers
- Sheridan Libraries
- Guides
- Protecting Human Subject Identifiers
- Software for De-identification
Johns Hopkins Data Services has compiled a list of de-identification software tools and applications that can be used in de-identifying research data for public sharing. The information on this page is provided for informational purposes only and does not constitute an endorsement of any particular tool for data de-identification. In general, no software provides automatic de-identification. All require care in setting up, understanding what they do, and assessing results for changes and remaining utility of the dataset for its purpose. We at Data Services do not support this software nor have necessarily tried them, but we would appreciate feedback if you have tried them, and any recommendations of tools not listed here. Investigators and researchers should ensure that they follow data governance policies and procedures that apply to their data. Refer to your IRB and the Johns Hopkins Privacy Office for further information.
Tabular and Structured Data De-identification Tools
sdcMicro
-
Description: R Package for the generation of anonymized microdata, i.e. for the creation of public- and scientific-use files. In addition, various risk estimation methods are included.
-
Download it: http://sdctools.github.io/sdcMicro/index.html
-
CRAN https://cran.r-project.org/web/packages/sdcMicro/sdcMicro.pdf
sdcApp
-
Description: Graphical user interface of sdcMicro allowing you to apply disclosure limitation techniques to microdata even if you are not an expert in the R programming language.
-
Find it: https://shiny.posit.co/r/gallery/life-sciences/sdcapp-microdata/
-
Manual: https://sdctools.github.io/sdcMicro/articles/sdcMicro.html
The sdcTable package in R
-
Description: R package for applying methods for statistical disclosure control in tabular data such as primary and secondary cell suppression.
-
Download it: https://github.com/sdcTools/sdcTable
-
Additional Information
-
CRAN: https://cran.r-project.org/web/packages/sdcTable/index.html
-
Manual: https://sdctools.github.io/sdcTable/articles/sdcTable.html
ARX Data Anonymization Tool
-
Description: It is a comprehensive open-source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.” It uses a graphical user interface to remove and transform personal data from tabular data. Available on Windows, Mac, and Linux.
-
Download it: https://arx.deidentifier.org/downloads/
-
Additional info:
-
User manual: https://arx.deidentifier.org/anonymization-tool/
-
Publications about it: https://arx.deidentifier.org/publications/
-
Video titled “Anonymizing health data with ARX to promote Open Science on COVID-19": https://www.youtube.com/watch?v=E5t2Cv5FbmA
Amnesia
-
Description: Developed in the EU to meet GDPR guidelines for data anonymization, this tool uses various de-identification methods such as masking, pseudo-anonymization, K-anonymity algorithm, etc. so that you can share your results.
-
Download it: https://amnesia.openaire.eu/download.html
-
Additional info:
-
Documentation: https://amnesia.openaire.eu/about-documentation.html
-
Tutorials: https://amnesia.openaire.eu/tutorials.html
mu-Argus 5.1
-
Description: Tool for researcher to apply disclosure limitation techniques to create safe microdata files using Windows OS.
-
Download it: https://github.com/sdcTools/muargus/releases
-
Manual : https://research.cbs.nl/casc/Software/MUmanual5.1.3.pdf
tau-Argus 4.1
-
Description: Software program designed to protect statistical tables using Windows OS.
-
Download it: https://github.com/sdcTools/tauargus/releases
-
Getting started: https://github.com/sdcTools/manuals/blob/master/tau-argus/Getting%20started%20with%20TauArgus.pdf
Privacy Analytics Eclipse
-
Software description: “Eclipse is a modular enterprise software platform that transforms vast stores of sensitive data, safely and at scale, in an automated process. It empowers you to anonymize structured data of any size and sensitivity to the highest possible standard. ” This is proprietary commercial software generally intended as larger scale enterprise installation for medical record platforms.
-
Intended purpose: Working with structured medical recor
Cornell Anonymization Toolkit (CAT)
-
Software description: “designed for interactively anonymizing published dataset to limit identification disclosure of records under various attacker models” (Has not been updated in the last 10 years.)
-
Intended purpose: Medical records – tabular data
The University of Texas at Dallas Anonymization Toolbox
-
Software description: a researcher-compiled implementation (from UT Dallas Data Security and Privacy Lab) of various anonymization methods into a toolbox for public use by researchers. (May not have been updated since 2012.)
-
Intended purpose: Unstructured text files
De-identification tools within applications & coding tools
sdcMicro
-
Description: R Package for the generation of anonymized microdata, i.e. for the creation of public- and scientific-use files. In addition, various risk estimation methods are included.
-
CRAN https://cran.r-project.org/web/packages/sdcMicro/sdcMicro.pdf
sdcApp
-
Description: Graphical user interface of sdcMicro allowing you to apply disclosure limitation techniques to microdata even if you are not an expert in the R programming language.
The sdcTable package in R
-
Description: R package for applying methods for statistical disclosure control in tabular data such as primary and secondary cell suppression. CRAN: https://cran.r-project.org/web/packages/sdcTable/index.html
-
Manual: https://sdctools.github.io/sdcTable/articles/sdcTable.html
-
Research Electronic Data Capture (REDCap)
REDCap is a mature, secure web application for building and managing online surveys and databases. Reach out to redcap@jhu.edu for more information or sign up for one of their Zoom-in clinic sessions if you need help with REDCap. REDCap Training Central is for people who are interested in using REDCap and want to learn more about it.
REDCap provides advanced de-identification options that can be optionally used when exporting data, such as removing known Identifier fields, removing invalidated text fields, notes fields, or date fields, date shifting and hashing of the record names. These options provide greater security and data protection when a user is exporting sensitive data out of REDCap.
- Protecting Sensitive Participant Information (PHI) in REDCap from Lifespan Biostatistics, Epidemiology, Research Design, and Informatics (BERDI): An excellent illustrated overview of REDCap de-identification features
-
Date shifting: As part of the de-identification options in REDCap, date fields may be shifted to hide the actual dates. Choosing the option will shift dates by a chosen amount for each record, preserving the interval date intervals. Date shifting leaves the database record intact and will not affect the actual saved dates in the database. Data exports retain the shifted dates. Dates can be shifted up to 364 days, and applies an algorithm based on assigning a value to each Subject ID, making the amount of shift unique to each record while retaining internally consistent durations among events. Shifting dates at least plus or minus 180 days meets HIPAA Safe Harbor criteria for de-identifying dates.
Digital Image De-identification Tools
JHM AI & Data Trust Guidelines for scanned image de-identification
- Overview and tools list developed by JHU SOM eRadiology
DICOMCleaner
-
Software description: “DicomCleaner™ is a free, open-source tool with a user interface for importing, process of removing and/or replacing information in the DICOM header, and saving sets of DICOM instances (files)”
-
Intended purpose: Medical Images in DICOM (Digital Imaging and Communications in Medicine) format
-
Mac and Windows versions: http://www.osirix-ukusergroup.org/dicom-cleaner
DicomAnonymizer
-
A Python package to anonymize DICOM files according to DICOM standards
-
Need to know how to use Python to use this package
The De-identification Toolbox (formerly DeID)
-
A data sharing tool for neuroimaging studies
-
A java program to remove identifying information in neuroimaging datasets
-
Related article: Song, X., Wang, J., Wang, A., Meng, Q., Prescott, C., Tsu, L., & Eckert, M. A. (2015). DeID – a data sharing tool for neuroimaging studies. Frontiers in Neuroscience, 9. https://doi.org/10.3389/fnins.2015.00325
deepdefacer
-
Automatic Removal of Facial Features via Deep Learning
-
A MRI anonymization tool written in Python
-
Can quickly deface 3D MRI images of any resolution and size
-
Related article: Khazane, A., Hoachuck, J., Gorgolewski, K. J., Poldrack, R. A. (2022) DeepDefacer: Automatic Removal of Facial Features via U-Net Image Segmentation. [preprint] arXiv, https://doi.org/10.48550/arXiv.2205.15536
MRI_deface:
-
Automated Defacing Tools
-
Also see MiDeFace, a newer tool for defacing
-
Related article: Bischoff-Grethe, A., Ozyurt, I. B., Busa, E., Quinn, B. T., Fennema-Notestine, C., Clark, C. P., Morris, S., Bondi, M. W., Jernigan, T. L., Dale, A. M., Brown, G. G., & Fischl, B. (2007). A technique for the deidentification of structural brain MR images. Human Brain Mapping, 28(9), 892–903. https://doi.org/10.1002/hbm.20312
Pydeface:
-
A tool to remove facial structure from MRI images
-
A Python package
-
Related article: Omer Faruk Gulban, Dylan Nielson, john lee, Russ Poldrack, Chris Gorgolewski, Vanessasaurus, & Chris Markiewicz. (2022). poldracklab/pydeface: PyDeface v2.0.2 (v2.0.2). Zenodo. https://doi.org/10.5281/zenodo.6856482
Quickshear:
-
Uses a skull stripped version of an anatomical images as a reference to deface the unaltered anatomical image
-
A Python package
-
Related article: Schimke, Nakeisha, and John Hale. "Quickshear defacing for neuroimages." Proceedings of the 2nd USENIX conference on Health security and privacy. USENIX Association, 2011.
BIDSonym:
-
A BIDS App for the de-identification of neuroimaging data
-
Gathers all T1w images from a BIDS dataset and applies one of several popular de-identification algorithms, MRI_deface, Pydeface, Quickshear and mridefacer
-
Follow the BIDS apps standards
-
Related article: Herholz, P., Ludwig, R. M., & Poline, J. (2021, January 24). BIDSonym - a BIDSapp for the pseudo-anonymization of neuroimaging datasets. https://doi.org/10.31234/osf.io/3aknq
Qualitative and Unstructured Text Data De-identification Tools
NLM-Scrubber
-
Software description: “A freely available, HIPAA compliant, clinical text de-identification tool designed and developed at the National Library of Medicine.” For records converted to ASCII text, runs on a command line/terminal interface in Linux or Windows
-
Intended purpose: Uses natural language processing to automatically redact direct identifiers typically found in medical records, including addresses below state level, names, dates, and alphanumeric identifiers such as patient account numbers. It attempts to follow HIPAA rules for levels of specificity to retain or remove (i.e., city but not state.)
deid software package
-
Software description: “includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records”
-
Intended purpose: For free text in medical records
UCSF Philter
-
A command-line based clinical text de-identification software that removes protected health information (PHI) from any plain text file. It primarily identifies direct identifiers and can be given filtered “black list” terms to flag with annotations.
-
GitHub - BCHSI/philter-ucsf: Open source clinical text de-identification
-
Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes (article) – UCSF developed tool
Privacy Analytics Unstructured Text Anonymization
- Software description: “Turn personal information in unstructured text—from medical reports to customer feedback—into format-preserved, compliant data. Our offering helps you innovate while protecting privacy.” This is a commercial product intended usually for enterprise purchases for large scale text record systems, specializing in medical records. Overview PDFIntended purpose: Unstructured medical records