Protecting Identifiers in Human Subjects Data

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

5 steps for removing identifiers from datasets

Here are five categories of tasks for preparing datasets for sharing, either among collaborators, or for restricted or public access depending on the extent of de-identification required. These steps can also be used to review datasets for identifying disclosure risk. This section introduces the steps. Techniques and complexity of implementing the steps can vary wisely among datasets and variables, from simple find-and-replace to advanced statistical methods. Simple transformations may be sufficient to improve protection of datasets for internal use and collaboration. Preparing fully de-identified public-use datasets may require a professional statistician for more complex transformations.

Please note: Applying these steps does not necessarily certify your dataset as "de-identified" or approved for restricted or public release. Investigators should ensure that they follow applicable JHU data governance policies, with guidance from IRB and the Johns Hopkins Privacy Office. Medical research subject to oversight by the JHM Data Trust Council has additional requirements regarding approval of disclosure protection prior to sharing de-identified or restricted datasets. Please refer to the JHM Data Trust Council section of this guide and the Data Trust FAQ for details.

1. Review and remove direct identifiers

Replace essential numerical values with truncated or range values, or more advanced anonymizing techniques. Replace essential text variables with codes or broader categories that are of use for analysis or reference.
Names Addresses Phone, Cell
Email addresses Government ID no. Linked ID numbers
URLs,IP addresses Treatment provider locations Photos/biometric IDs
   

2. Remove and re-code specific dates

Focus especially on dates that can be linked to public records. Make changes that maintain analytic utility, remove those not needed for analysis.
Date Identifier Change to
Specific day  broaden to week, month
Date of birth Age or age range
Date of interview/treatment remove if not needed, or shift by a fixed value
Changing specific dates

3. Remove and re-code geographic variables

Location is often the source of risk, the more specific the geography, the more care required. Retain only the level of specificity required for analysis. If geographic specificity is kept, then more care will be needed to disguise other variables, especially for "outlier" cases that have low participant counts for particular combinations of variables.
Geographic variables to remove or recode Remedy:
Street Address State or broad population region
Census Tract Consolidated tract regions
Zip Code or ZCTA Truncate to first 3 digits
County Consolidate or categorize (Eastern/Western counties, Urban/Rural counties)
Area populations < 100,000 Consolidate regions for higher populations or number of participants within a given region

4. Remove / recode variables that pose risk of link to external datasets

Indirect variables that could possibly link to external datasets can be the most challenging both to locate and to remove the risk. Pay particular attention to variables that could link to external data sources that are potentially public or accessible, such as certain government registries.  Social media, such as Facebook profiles, may also facilitate re-identification given specific regional locations or demographics of participants. Sometimes values can be masked or randomized rather than removed, such as changing an exact amount to a range value in this example:

Copyright © 2018, by Johns Hopkins Data Services

5. Re-sort and renumber records and IDs from external source data and IDs created for your study

Cases or ID codes that may have come from source data used for a study, such as a set of government medical records, could potentially facilitate being re-linked to the source data if shared. The same goes for case IDs created by the research study to track participants or other variables. When sharing datasets, create new randomized IDs for subjects or other case variables. This can be done by calculations in databases, or manually in spreadsheets.
For example, to randomize subject ID's in excel: Random SubjID SubjIDclean
1. Insert a column, Enter =RAND() in first row & copy down to other cells for random numbers 1 INT2.SUBJ.F213 INT2.S.001
2. Sort rows by the random number column 2 INT2.SUBJ.F121 INT2.S.002
3. create new  sequential numbered variables in a new column for sharing 3 INT2.SUBJ.M815 INT2.S.003