Protecting Identifiers in Human Subjects Data

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Protecting identifiers during data collection

Protecting identifiers during data collection

Before collecting human subjects data in the field or clinic:Codebook pseudonyms

  • If recruiting participants, keep their contact info secure and separate from materials you bring to the field.
  • Prior to data collection, prepare an anonymization scheme and/or secure key code list.
  • Document identifiers and their substitutions in a secure list or codebook. Note which variables could contain indirect identifiers to check when de-identifying the data.

During collection: Use passwords/encryption for mobile devices. Use pseudonyms/codes vs. bringing participant info to research sites. Collect only the identifiers necessary for the research.

In the project office: 

  • Turn on full disk encryption (Bitlocker for Windows, FileVault for Mac). This is an institutional requirement. [See IT@JH's Device Encryption policy] Sensitive files, however, may become accessible to hackers (or human error) once logged in. Programs like Veracrypt can encrypt identifier files and folders when not in use, and especially before transmitting to cloud backup or colleagues.
  • Keep subject ID codes and master keys separate from data, encrypted. 
  • Use the codebook or data dictionary to flag at-risk variables to alter before sharing, e.g., values to ranges or broader categories.

During analysis

Finding and removing identifiers during data analysis

It is often most efficient to locate and change identifiers encountered while analyzing the data, rather than hunting for them at a project’s end.

Create a working copy of data for altering identifiers. Use data analysis software to mark instances of identifiers to find and change later:

Tabular data

  • For tabular data, clone columns to alter identifiers next to their original value to use in a shared version.
  • Look for outliers, participants having uncommon combinations of variables that can become indirect identifiers
  • Some software tools help locate identifiers, such as by identifying addresses or names. Others can assist in manually marking identifiers to change, such as Qualitative Data Analysis software that can directly code untranscribed audio/video files. [List of software tools to assist in disclosure protection]
  • The goal is to create a working version of data for analysis with mostly de-identified variables either in use or marked for removal of shared versions.