Data Management and Sharing

This guide gathers overviews and resources for data management and sharing following the research workflow for data, from preparing data management and sharing plans for grant proposals, conducting research, to sharing research data.

By following good data management practices early on in your research, such as documenting your data as you go, it will be easier for you to publish and share your research. You will need to determine which of your data, code, and documentation to share and prepare it for consumption outside of your immediate research team. 

Ethics and Compliance: You need to be aware of restrictions and rules for data sharing

De-identifying human participants data: Please see the section for a list of trainings and articles on removing direct and indirect identifiers from research data.

Selecting data for deposit: Share enough data, code, and documentation to make your research reproducible.

Document Data: Please see the page called "document data" to find information on and resources you can use to both organize and document your research for sharing. 

Selecting Data for Sharing and Preserving

In general, you should share enough data, code, and documentation so that others can reproduce your work. Also, your funder likely also has a definition of what is considered data and may provide guidance on what to share as well (see funder requirements).

Guidance on Selecting data for sharing

Organizing and Documenting Datasets for Repositories

Selecting and Organizing Files for a Dataset

The following are step by step guidelines for organizing and assembling data and/or software collections to be deposited to the JH Research Data Repository from a completed set of research work, but they apply to most data repositories. The steps include physically organizing digital files (project data or other products of the research, e.g. software or codebooks), and documenting datasets and their components.  The two goals of these guidelines are to produce datasets that 1) are understandable to others accessing the data online (rather than contacting you with basic questions about the content); and 2) that you can access and understand in the future as a preservation and a useful record of the research project, data supporting a publication, or analysis code.

The conceptual structure of an archived data collection is as follows:

Organization of Projects and Datasets

It may be useful as a first step to look at the organization of datasets in the Johns Hopkins Research Data Repository (archive.data.jhu.edu), on the Dataverse platform. Datasets may be deposited individually, but multiple datasets can be collected under a project. A Project page for an individual, lab, or research project can give high-level information common to all collections contained therein. Each Dataset receives a unique DOI that can be used for citation. Each dataset contains a set of files, and a set of descriptive “catalog” fields with the data set title, authors, description, related publications, and other contextual information. Keep in mind that a dataset in the repository is basically a set of files that can be referenced using a single citation.

After determining what belongs in a single citable dataset, think about how to organize each downloadable file or component. A component can be a single file like a spreadsheet, but is often an entire folder of files and subdirectories that can be organized in a range of ways. These are typically downloaded as a .zip file. Since each component is downloaded separately, consider what sets of files would be relevant for viewers of the dataset download as a group.

he following examples list downloadable components in brackets:

  • A dataset for a publication may have: [Figure 1 Data], [Figure 2 Data]
  • A dataset for an experiment may include: [analysis spreadsheet w/instrument output files], [software code & documentation], [lab notes]
  • A dataset for a piece of research software may include: [Source code + dependencies & documentation], [executables], [sample data]

When gathering files for a collection, consider how files and file folders relate to each other hierarchically (i.e. what should be downloaded together) and sequentially (i.e., by methodological steps in the workflow of the project). You may decide not to include all your research files in an archived dataset to share online (e.g., raw data) but it may be useful to gather them initially to keep a more complete version of a dataset for yourself.

Also consider what supplemental materials should go with a dataset, especially documents or other files that provide context for others accessing that dataset. These could include sample inputs/outputs (in the case of software), source code documentation, survey codebooks, lab notes, or workflow diagrams of data productions.

Documentation to include with a repository dataset deposit

We recommend including additional documentation for the entire dataset, and ideally for each individual component that can be downloaded. For example, a folder of material can include, at minimum, a text document titled “Readme” or “Content Notes” that describes the folder’s contents to whatever detail will orient other researchers. The researcher may not return to the Repository dataset’s description after downloading, so this content overview file will remain with the downloaded component for future reference.  Consider adding additional usage instruction documentation for datasets containing software, instrument source data, or other materials requiring special procedures for reuse or interpretation.

File-level Metadata

We also recommend adding metadata to selected files within your data datasets, especially where additional information will aid another researcher in making use of those files.

A clear example is for shared spreadsheets. If providing a table of figures with acronym headers, is there a code key to explain header acronyms, variable units and classification, and worksheet names? These can be added to the spreadsheet in a separate worksheet. Where helpful metadata can be added will largely depend on the formats and types of data you intend to deposit.  Many file formats allow the addition of author, project references, and usage rights within “Properties” metadata. Many instrument-generated data formats also support metadata, sometimes in a format standardized for particular fields of study allowing better search and integration with compatible datasets and repositories.

Your analysis software may also be able to add metadata to your researcher files. Consider adding author/project identifiers, dates, or other relevant information that your research files can carry with them as they are shared and reused. More suggestions can be found in our "document data" section.

Ethics and Compliance

Check with your divisional IRB office if you are unsure what you can share and review applicable government policies and guidance on protecting PHI.

Divisional JHU IRBs
Policies on Human Participants and Data Sharing

De-identifying Human Subjects Data

With researchers increasingly encouraged or required to share their data, preparing to share datasets with confidential identifiers of people and organizations is particularly challenging.

JHU Data Services Resources

Protecting Human Subject Identifiers Guide: A very comprehensive guide that will introduce you to concepts and basic techniques for disclosure analysis and protection of personal and health identifiers in research data for public or restricted access, following applicable JHU data governance policies.

Webinars: Go to our calendar to find the next live webinar about of common privacy disclosure risks from personal and health identifiers in data and techniques for de-identifying data for external collaborators and public databases. We also discuss preparing consent forms that facilitate data sharing, and keeping identifier data secure during and after projects.

Interactive, online training: JHU Data Services has developed an online training to be taken at your convenience. It provides an overview of the types of identifiers, and how to determine if your data have disclosure risk. You will also learn about available JHU resources to help you with de-identifying data. 

Applications to Assist in De-identification of Human Subjects Research DataA list of de-identification software tools and applications that researchers can use in de-identifying their research data for more public sharing.

Additional Resources

NIH: Protecting Privacy When Sharing Human Research Participant Data: This supplemental information was created to assisting researchers in addressing privacy considerations when sharing human research participant data. It provides a set of principles, best practices, and points to consider for creating a robust framework for protecting the privacy of research participants when sharing data.

NIST de-identification tools: National Institute of Standards and Technology has compiled a list of de-identification tools and also descriptions of each of the tools.

Cancer Image Archive: https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview

National Library of Medicine Scrubber: a freely available clinical text deidentification tool designed and developed at the National Library of Medicine.  Watch this presentation to learn more. 

FAIR Principles

The FAIR Guiding Principles for scientific data management and stewardship, published in 2016, outlined methods for broadening access to shared data, focusing particularly on better discovery and open access through data repositories, and better reuse through documentation and machine-readable metadata standards. FAIR Principles fit within the wider promotion of Open Science and reproducible research. Data sharing policies by funders often cite these principles as a goal for making publicly funded data more widely available. 

See also, JHU Data Services Online training on Open Science and the Open Access Guide by the Sheridan Libraries