Data Management and Sharing
By following good data management practices early on in your research, such as documenting your data as you go, it will be easier for you to publish and share your research. You will need to determine which of your data, code, and documentation to share and prepare it for consumption outside of your immediate research team.
Ethics and Compliance: You need to be aware of restrictions and rules for data sharing
De-identifying human participants data: Please see the section for a list of trainings and articles on removing direct and indirect identifiers from research data.
Selecting data for deposit: Share enough data, code, and documentation to make your research reproducible.
Selecting Data for Sharing and Preserving
In general, you should share enough data, code, and documentation so that others can reproduce your work. Also, your funder likely also has a definition of what is considered data and may provide guidance on what to share as well (see funder requirements).
Guidance on Selecting data for sharing
- How to Appraise and Select Research Data for Curation: From the Data Curation Center, criteria you can use for assessing what is impactful to share.
- Selecting data for publication: From the CESSDA "Data Management Expert Guide", questions to ask yourself about your data when determining what to share.
Organizing and Documenting Datasets for Repositories
Selecting and Organizing Files for a Dataset
The following are step by step guidelines for organizing and assembling data and/or software collections to be deposited to the JH Research Data Repository from a completed set of research work, but they apply to most data repositories. The steps include physically organizing digital files (project data or other products of the research, e.g. software or codebooks), and documenting datasets and their components. The two goals of these guidelines are to produce datasets that 1) are understandable to others accessing the data online (rather than contacting you with basic questions about the content); and 2) that you can access and understand in the future as a preservation and a useful record of the research project, data supporting a publication, or analysis code.
The conceptual structure of an archived data collection is as follows:
It may be useful as a first step to look at the organization of datasets in the Johns Hopkins Research Data Repository (archive.data.jhu.edu), on the Dataverse platform. Datasets may be deposited individually, but multiple datasets can be collected under a project. A Project page for an individual, lab, or research project can give high-level information common to all collections contained therein. Each Dataset receives a unique DOI that can be used for citation. Each dataset contains a set of files, and a set of descriptive “catalog” fields with the data set title, authors, description, related publications, and other contextual information. Keep in mind that a dataset in the repository is basically a set of files that can be referenced using a single citation.
After determining what belongs in a single citable dataset, think about how to organize each downloadable file or component. A component can be a single file like a spreadsheet, but is often an entire folder of files and subdirectories that can be organized in a range of ways. These are typically downloaded as a .zip file. Since each component is downloaded separately, consider what sets of files would be relevant for viewers of the dataset download as a group.
he following examples list downloadable components in brackets:
- A dataset for a publication may have: [Figure 1 Data], [Figure 2 Data]
- A dataset for an experiment may include: [analysis spreadsheet w/instrument output files], [software code & documentation], [lab notes]
- A dataset for a piece of research software may include: [Source code + dependencies & documentation], [executables], [sample data]
When gathering files for a collection, consider how files and file folders relate to each other hierarchically (i.e. what should be downloaded together) and sequentially (i.e., by methodological steps in the workflow of the project). You may decide not to include all your research files in an archived dataset to share online (e.g., raw data) but it may be useful to gather them initially to keep a more complete version of a dataset for yourself.
Also consider what supplemental materials should go with a dataset, especially documents or other files that provide context for others accessing that dataset. These could include sample inputs/outputs (in the case of software), source code documentation, survey codebooks, lab notes, or workflow diagrams of data productions.
Documentation to include with a repository dataset deposit
We recommend including additional documentation for the entire dataset, and ideally for each individual component that can be downloaded. For example, a folder of material can include, at minimum, a text document titled “Readme” or “Content Notes” that describes the folder’s contents to whatever detail will orient other researchers. The researcher may not return to the Repository dataset’s description after downloading, so this content overview file will remain with the downloaded component for future reference. Consider adding additional usage instruction documentation for datasets containing software, instrument source data, or other materials requiring special procedures for reuse or interpretation.
We also recommend adding metadata to selected files within your data datasets, especially where additional information will aid another researcher in making use of those files.
A clear example is for shared spreadsheets. If providing a table of figures with acronym headers, is there a code key to explain header acronyms, variable units and classification, and worksheet names? These can be added to the spreadsheet in a separate worksheet. Where helpful metadata can be added will largely depend on the formats and types of data you intend to deposit. Many file formats allow the addition of author, project references, and usage rights within “Properties” metadata. Many instrument-generated data formats also support metadata, sometimes in a format standardized for particular fields of study allowing better search and integration with compatible datasets and repositories.
Your analysis software may also be able to add metadata to your researcher files. Consider adding author/project identifiers, dates, or other relevant information that your research files can carry with them as they are shared and reused. More suggestions can be found in our "document data" section.
Ethics and Compliance
Check with your divisional IRB office if you are unsure what you can share and review applicable government policies and guidance on protecting PHI.
Divisional JHU IRBs
- Johns Hopkins Medicine Institutional Review Board
- Homewood IRB: for Krieger School of Arts and Sciences, Whiting School of Engineering, School of Education, Carey Business School, Nitze School of Advanced International Studies, and Peabody Institute.
- School of Public Health IRB
Policies on Human Participants and Data Sharing
HIPAA for Professionals by US Health and Human Services: U.S. Department of Health and Human Services provides information regarding patient privacy, de-identification methods, security, etc., for people who work with data containing Protected Health Information/Personal Identifiable Information. If you plan to work with PHI/PII data, their Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule defines 18 personal identifiers and acceptable methods to de-identify patient information.
FERPA policies resource site from Dept. of Education regarding sharing and accessing student data for educational research
CITI Training on Human Subjects Research: Core training in human subjects research and includes the historical development of human subject protections, ethical issues, and current regulatory and guidance information.
- HHS Policy for the Protection of Human Subjects ('Common Rule'): This policy regulates the protection of human subjects in research by providing a robust set of protections for research subjects.
JHM Data Trust: The Johns Hopkins Medicine Data Trust Council has been established to provide JHM researchers with the technical infrastructure, standards, policies and procedures, and organization needed to bring together patient and member-related data from across the health system to support our mission. The goals of the Data Trust are to: Ensure security and privacy of our patients’ data, consolidate teams to address organizational priorities and reduce redundancy, and increase the value of data through better integration and analytics.
- Accessing data from JH Medicine clinical enterprise systems such as EPIC or PMAP requires approval from the Data Trust Reseach Subcouncil, following their procedures for requesting data.
- Sharing JH Medicine Data (patient- or member-related data stored in clinical enterprise systems) with researchers outside of JHM (including other JHU divisions such as School of Public Health, and Krieger Arts & Sciences), or data repositories such as Johns Hopkins Research Data Repository, may require Data Trust review, including IRB-approved protocols for data sharing and data de-identification if needed. More information.
De-identifying Human Subjects Data
With researchers increasingly encouraged or required to share their data, preparing to share datasets with confidential identifiers of people and organizations is particularly challenging.
JHU Data Services Resources
Protecting Human Subject Identifiers Guide: A very comprehensive guide that will introduce you to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies.
Webinars: Go to our calendar to find the next live webinar about of common privacy disclosure risks from personal and health identifiers in data and techniques for de-identifying data for external collaborators and public databases. We also discuss preparing consent forms that facilitate data sharing, and keeping identifier data secure during and after projects.
Interactive, online training: JHU Data Services has developed an online training to be taken at your convenience. It provides an overview of the types of identifiers, and how to determine if your data have disclosure risk. You will also learn about available JHU resources to help you with de-identifying data.
Applications to Assist in De-identification of Human Subjects Research Data: A list of de-identification software tools and applications that researchers can use in de-identifying their research data for more public sharing.
NIH: Protecting Privacy When Sharing Human Research Participant Data: This supplemental information was created to assisting researchers in addressing privacy considerations when sharing human research participant data. It provides a set of principles, best practices, and points to consider for creating a robust framework for protecting the privacy of research participants when sharing data.
NIST de-identification tools: National Institute of Standards and Technology has compiled a list of de-identification tools and also descriptions of each of the tools.
Cancer Image Archive: https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview
The FAIR Guiding Principles for scientific data management and stewardship, published in 2016, outlined methods for broadening access to shared data, focusing particularly on better discovery and open access through data repositories, and better reuse through documentation and machine-readable metadata standards. FAIR Principles fit within the wider promotion of Open Science and reproducible research. Data sharing policies by funders often cite these principles as a goal for making publicly funded data more widely available.
- FAIR Principles: overview provided by the GO FAIR Initiative
- FAIRsharing.org: provides resources and database collections supporting FAIR principles for various stakeholders including:
- FAIR Sharing Standards: A registry of terminology artefacts, models/formats, reporting guidelines, and identifier schemas.
- FAIR Data Repositories & Knowledgebases: A registry of knowledgebases and repositories of data and other digital assets
- FAIRsharing.org Data Policies database: A registry of data preservation, management and sharing policies from international funding agencies, regulators, journals, and other organisations.
- CARE Principles for Indigenous Data Governance: discussing special considerations for sharing data from indigenous populations
- FASEB Science Policy and Advocacy: Federation of American Societies for Experimental Biology's collection of policy statements and best practices regarding data management and sharing, including the DataWorks! initiative promoting data sharing and exemplary data management plans.