The Sheridan Libraries

Protecting Human Subject Identifiers

Introduction to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies. See Overview section for details.

Data Services Profile

We are here to help you find, use, manage, visualize and share your data. Contact us to schedule a consultation. View and register for upcoming workshops. Visit our website to learn more about our services.

5 steps for removing identifiers from datasets

Here are five categories of tasks for preparing datasets for sharing, either among collaborators, or for restricted or public access depending on the extent of de-identification required. These steps can also be used to review datasets for identifying disclosure risk. This section introduces the steps. Techniques and complexity of implementing the steps can vary wisely among datasets and variables, from simple find-and-replace to advanced statistical methods. Simple transformations may be sufficient to improve protection of datasets for internal use and collaboration. Preparing fully de-identified public-use datasets may require a professional statistician for more complex transformations.

Please note: Applying these steps does not necessarily certify your dataset as "de-identified" or approved for restricted or public release. Investigators should ensure that they follow applicable JHU data governance policies, with guidance from IRB and the Johns Hopkins Privacy Office. Medical research subject to oversight by the JHM Data Trust Council has additional requirements regarding approval of disclosure protection prior to sharing de-identified or restricted datasets. Please refer to the JHM Data Trust Council section of this guide and the Data Trust FAQ for details.

1. Review and remove direct identifiers

Replace essential numerical values with truncated or range values, or more advanced anonymizing techniques. Replace essential text variables with codes or broader categories that are of use for analysis or reference.

Names	Addresses	Phone, Cell
Email addresses	Government ID no.	Linked ID numbers
URLs,IP addresses	Treatment provider locations	Photos/biometric IDs

2. Remove and re-code specific dates

Focus especially on dates that can be linked to public records. Make changes that maintain analytic utility, remove those not needed for analysis.

Date Identifier	Change to
Specific day	Year with date shifting for PHI Safe Harbor standards
Date of birth	Age or age range
Date of interview/treatment	remove if not needed, or shift by a fixed value

3. Remove and re-code geographic variables

Location is often the source of risk, the more specific the geography, the more care required. Retain only the level of specificity required for analysis. If geographic specificity is kept, then more care will be needed to disguise other variables, especially for "outlier" cases that have low participant counts for particular combinations of variables.

Geographic variables to remove or recode	Remedy:
Street Address	State or broad population region
Census Tract	Consolidated tract regions
Zip Code or ZCTA	Truncate to first 3 digits
County	Consolidate or categorize (Eastern/Western counties, Urban/Rural counties)
Area populations < 100,000	Consolidate regions for higher populations or number of participants within a given region

4. Remove / recode variables that pose risk of link to external datasets

Indirect variables that could possibly link to external datasets can be the most challenging both to locate and to remove the risk. Pay particular attention to variables that could link to external data sources that are potentially public or accessible, such as certain government registries. Social media, such as Facebook profiles, may also facilitate re-identification given specific regional locations or demographics of participants. Sometimes values can be masked or randomized rather than removed, such as changing an exact amount to a range value in this example:

5. Re-sort and renumber records and IDs from external source data and IDs created for your study

Cases or ID codes that may have come from source data used for a study, such as a set of government medical records, could potentially facilitate being re-linked to the source data if shared. The same goes for case IDs created by the research study to track participants or other variables. When sharing datasets, create new randomized IDs for subjects or other case variables. This can be done by calculations in databases, or manually in spreadsheets.

For example, to randomize subject ID's in excel:	Random	SubjID	SubjIDclean
1. Insert a column, Enter =RAND() in first row & copy down to other cells for random numbers	1	INT2.SUBJ.F213	INT2.S.001
2. Sort rows by the random number column	2	INT2.SUBJ.F121	INT2.S.002
3. create new sequential numbered variables in a new column for sharing	3	INT2.SUBJ.M815	INT2.S.003

Next: Introduction to advanced de-identification techniques

Last Updated: May 28, 2025 4:20 PM
URL: https://guides.library.jhu.edu/protecting_identifiers
Print Page