Protecting Identifiers in Human Subjects Data

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Definitions of disclosure risk

What is disclosure risk?

Here are some central terms for these guidelines:

Personal Identifiers: Private information, that subjects expect not to be made public that are linked to information associated with a unique individual.

Common acronyms for personal identifiers, PII & PHI:

PII Personally identifiable information

  1. "Any information maintained by an agency…used to distinguish or trace an individual’s identity…
  2. Any other information that is linked or linkable to an individual."  (NIST SP- 800-122)

PHI:  Protected Health Information 

  1. Created or received by a health care provider
  2. Relating to physical or mental health of an individual or provision of care (past, present, or future) and (i) that identifies or (ii) could be used to identify the individual. (HIPAA's Privacy Rule)

Types of Identifiers

Types of Identifiers

Two types of identifiers may collected during research, which would need protection from being revealed if that data is shared (either on purpose or accidently!)

Direct identifiers are the types of information that directly links variables to subjects, and to people or institutions associated with them.

HIPAA lists 18 typical direct identifiers for PHI as part of the standards for patient protection used by US. Health and Human Services. See the tab for a summary list. 

Indirect identifiers, also called inferential identifiers or Quasi-identifiers, can be more challenging to locate and protect.

These are characteristics that may not be unique in the whole population, but are unique to your particular sample and can be correlated with other information to create direct identifiers and re-identify one or more participants in a study. See the tab for examples.

HIPAA lists 18 typical direct identifiers for PHI as part of the standards for patient protection used by US. Health and Human Services. 

1. Names 8-10. Medical record, health plan & account numbers
2. Geographic division smaller than State (e.g. census tract)  11-13. Certificate, licenses, vehicle/device numbers
3. Dates (except year) related to the individual, ages > 89 14-15 URLS, IP addresses
4-5. Phone, Fax No. 16. Biometric identifiers (fingerprint, voice recordings)
6. Email addresses 17.  Face photos or comparable images
7. Social Security Numbers 18. Any other unique identifying number, characteristic, or code

See HIPAA's full list here

Indirect or Inferential Identifiers: information that could be used to re-identify one or more participants in a study when combined with external information: such as dates, location, demographic info (race, ethnicity), or socioeconomic variables (occupation, salary)

Indirect, Inferential, or Quasi Identifiers Examples
Variables not themselves unique that could be correlated with other information to become direct identifiers Dates, location, demographic info (race, ethnicity), or socioeconomic variables (occupation, salary)
Variables or info combined with external information (e.g., government databases, social media profiles) Knowing family size, region, and dates of birth yields matches in a national health registry
Outlier subjects for variables: Common characteristics for a population, unique to sample Only one pregnant female veteran in sample


Disclosure is the term typically used to refer to inappropriate attribution of information to an individual or organization without their approval. (Any such approval usually takes the form of the terms within the consent forms that participants sign.)

There are three levels of disclosure risk to look out for:

  • Identity disclosure: subject can be directly identified. A more benign form of this is when someone can discover that a patient was in a study.
  • Attribute disclosure: reveals sensitive information about subject, e.g. HIV status
  • Inferential disclosure: released data makes it easier to determine a characteristic of a subject, through links to external information, such as the internet, e.g., Facebook profiles, or to outlier variables the unusual combinations of identifiers that narrow down to certain participants.

Does your study have disclosure risk?

Evaluate for Disclosure Risk examples
Geographically specific Within a city or county
Small samples organization-specific clinical patients
Purposive design longitudinal follow-up, snowball
Matching external file city records database
Sensitive content health or lifestyle risk factors
Vulnerable subjects under age of majority (usually below age 16-18)
Detailed variables, typically 5 or more: demographic, occupational, or biomedical variables


What is a de-identified dataset?

De-Identified for public access vs. Limited for restricted access

We often hear investigators say that they have de-identified a dataset, but have they? There are crucial differences between data that has been fully de-identified for public access, or only partially protected. A central factor is the presence of indirect/inferential identifiers remaining in the dataset.

Anonymization can be used as a more broad term to encompass two types of tasks to reduce disclosure risk for identifiers: masking and de-identifying.

  • Masking: techniques that alter direct identifiers so that the original is no longer useable for analysis. This includes deleting items like social security numbers, but also replacing identifiers with pseudonyms or codes that are often randomized, whether or not the association with the original value is retained as a separate list.
  • De-identification: techniques that apply minimal distortion of data so that they retain utility for analysis, while adequately protecting privacy. Methods include generalizing data elements (e.g. replacing age with range values) to add anonymity to direct identifiers, or more advanced statistical techniques to adequately reduce risk of re-identification, such as suppression of outlier values, grouped averaging (micro-aggregation) or record swapping. (See the introduction to advanced techniques section).  (As a practical matter, the term "de-identification" is often broadened to include masking.)

HIPAA's Privacy Rule offers a widely accepted standard for which datasets earn the label "de-identified." Investigators should be prepared to assess the extent to which their data should meet these criteria for whether and how these data will be accessed by collaborators, other researchers, or the public.

HIPAA defines three contexts for preparing PHI and PII data for access:

  • Limited Data Sets (LDS): Remove or anonymize 16 Direct Identifiers, and "facial" identifiers. Certain dates, location to Zip Code level, and birthdates may remain. Indirect identifiers may also remain if not easily removed. §164.514(e)
  • "Safe Harbor" anonymization level: 18 Direct Identifiers, 3-digit ZIP Code truncation, and Year only dates. Alter indirect/quasi identifiers to sufficiently limit "actual knowledge" of data that could, alone or in combination with other information, re-identify a data subject. (e.g. an "outlier" variable with one 98-year-old patient.) §164.514(b)
  • "Expert determination" Statistically De-identified Datasets: Removing or masking all direct and indirect identifiers. Typically, statistical techniques are applied to make remaining risk "very small." There is, however, no official external definition of what risk is acceptable, which can vary by case. The "experts" referred to are those knowledgeable and experienced in assessing and mitigating disclosure risk, and, ideally, those trained in the statistical techniques who can adequately assist in preparing datasets. Consult with JHU compliance officials (IRB, JHM Data Trust Council for SOM projects) when considering releasing Statistically de-identified data. JHU Data Services can also help with plans for sharing data §164.514(b)(1). (See also the sections on IRB and JHM Data Trust Council)

2 options for accessing and sharing data are based on the method of de-identification applied:

  • Restricted access: Since risk may remain from indirect identifiers, Restricted Access should be maintained for data applying only Limited Data Sets and, for most instances of Safe Harbor levels. This means sharing data only with approved collaborators or investigators under a Data Use Agreement that specifies their responsibility for protecting data. Certain data repositories and databases may be approved for restricted access datasets. (See SOM IRB: Definition of Limited Data Set)
  • Public access: Statistically de-identified datasets are generally approved for public release, and deposit to open access data repositories. Such approval generally requires experts in  screening for remaining disclosure risk by trained experts. Releasing data derived from most Johns Hopkins Medical data including EPIC requires review and approval by the JHM Data Trust Council

Choices for de-identifcation levels

What level of de-identification should I apply?

The choice often depends on the plans for accessing and sharing data. For nearly all human subjects research, removing or obscuring some of the identifiers is relevant to some degree. Removing identifiers not required for analysis or replacing them with pseudonyms, codes, and categories is a minimal best practice.  The following chart summarizes your choices:

Bottom line: Not necessarily a do-it-yourself activity

The takeaway message for JHU Researchers:

De-identification is more than removing names. HIPAA calls statistical de-identification "Expert determination" because it can take more expertise than a quick do-it-yourself approach if the datasets are large, complex, or have many variables that potentially link to external information. Making such data publicly available may require preparation and review by statisticians trained in risk reduction. As mentioned, medical research subject to oversite by the JHM Data Trust Council has additional requirements regarding approval of disclosure protection and de-identification of data shared externally. 

In most cases, aim for producing limited datasets, at or approaching the "Safe-Harbor" level, with secure restricted access only to those who will share responsibility if identities are disclosed.

That said, it is possible, even for health data, that datasets or representative portions of datasets can be successfully de-identified and approved for unrestricted release to other researchers for reference or further analysis. The other sections of these guides introduce some of these techniques, as well as guidance for planning and implementing anonymization throughout the research process.