Protecting Human Subject Identifiers
- Sheridan Libraries
- Guides
- Protecting Human Subject Identifiers
- Steps for De-identifying Data
Data Services Profile
We are here to help you find, use, manage, visualize and share your data. Contact us to schedule a consultation. View and register for upcoming workshops. Visit our website to learn more about our services.
5 steps for removing identifiers from datasets
Here are five categories of tasks for preparing datasets for sharing, either among collaborators, or for restricted or public access depending on the extent of de-identification required. These steps can also be used to review datasets for identifying disclosure risk. This section introduces the steps. Techniques and complexity of implementing the steps can vary wisely among datasets and variables, from simple find-and-replace to advanced statistical methods. Simple transformations may be sufficient to improve protection of datasets for internal use and collaboration. Preparing fully de-identified public-use datasets may require a professional statistician for more complex transformations.
1. Review and remove direct identifiers
|
| |||||||||
2. Remove and re-code specific dates
|
3. Remove and re-code geographic variables
|
4. Remove / recode variables that pose risk of link to external datasets
5. Re-sort and renumber records and IDs from external source data and IDs created for your study
For example, to randomize subject ID's in excel: | Random | SubjID | SubjIDclean |
---|---|---|---|
1. Insert a column, Enter =RAND() in first row & copy down to other cells for random numbers | 1 | INT2.SUBJ.F213 | INT2.S.001 |
2. Sort rows by the random number column | 2 | INT2.SUBJ.F121 | INT2.S.002 |
3. create new sequential numbered variables in a new column for sharing | 3 | INT2.SUBJ.M815 | INT2.S.003 |