20 Sharing your data the open way – Environmental Data Science: a Field Guide for Navigating the Age of Open Environmental Data

20.1 Why to share

Sharing the process and products of scientific analysis is a key component of the philosophy of open science. This openness may not always be the best choice for a given project, as some projects may need to keep data or analysis output restricted due to privacy concerns (with medical data or records for example) or for reasons related to data sovereignty (tribal or international data ownership). For many cases, however, it can be a very positive approach to implement.

Before we get into the specifics, it’s worth taking the time to think through the reasons why you may want to share your code and data. The first benefit you may derive from a workflow geared towards openness is that if you assume from the start that your work is going to be publicly available, that has a powerful effect on you as a researcher. Your mindset shifts as you work, because knowing that the code you write and data you organize will be out there for the world to see and build on incentivizes you to hold these things to a higher standard of quality, organization, and documentation. Remember, your most important and frequent collaborator is you in the future! So the work you put in now to keep the project organized and well documented makes it all that much easier for you to build on the work in the future, or to return to the project after stepping away from it for a while. If you supervise trainees in any capacity, whether they are undergraduate or graduate student researchers, interns at your company, or summer interns at your organization, encouraging them to keep their work structured in such a way to easily share and hand off will pay huge dividends for the usability of their efforts even after they’ve moved on in their careers.

The second major benefit of assuming from the beginning that your work will be shared publicly is that it leads to practices (like those we’ve been advocating for throughout this book) that make your analyses much more readily reproducible. Anyone who has worked with scientific coding projects over time or who has tried to run code developed by another researcher knows the challenges of figuring out not only how the code itself works, but also how to get the computational environment working. The approaches we outlined in Chapter 10 are some of the ways that you can specify, in a machine-readable way, the software dependencies necessary to re-run your analyses.

FIXME Add sentences about literate programming approaches like quarto/rmarkdown/jupyter/org babel

The third major benefit is that it allows for better peer review at all levels. By preparing your work to be shared, you make it easier to get feedback from others. At the most local level, this could include feedback from collaborators within your own organization, but can scale towards formal peer review during the publication process, or even post-publication review and feedback from researchers around the world.

The fourth major benefit circles back to where we started. While keeping your work well organized, reproducible, and well documented benefits you most directly, it also makes it substantially easier for others to build on your efforts. And in the end, isn’t that what science is all about?

20.2 How to share

Tools and tips
Sharing vs archiving, the value of a DOI
Licenses for text (Creative commons) v licenses for code (MIT/GPL/etc)
Tools for sharing code
- GitHub/GitLab/Codeberg
Tools for sharing computational environments
- DockerHub/GitHub container registry
Publication in a journal
- Open access flavors
  - Diamond
  - Gold
  - Green

20.3 When to share

When should we share?
1. Right away? With publication?

Use examples of different open workflows from Hampton et al Ecosphere (Hampton et al. 2015) – can be open throughout the entire process from grant proposal to publication, or can be open just upon publication. Both are valuable and may be appropriate in different contexts.

During peer review, as analyses are increasingly complex, it is important for reviewers to also be able to look at the analysis code and data underlying a study to get a sense for the validity of the conclusions. This can also help researchers catch inadvertent errors before the results are published (potentially heading off a later retraction).

20.4 Where to share

Where should we share?
1. Dryad
2. Zenodo
3. Figshare
4. NCBI
5. PANGAEA
6. etc

There are lots of places online where datasets can be shared, and the choice of which to use for a particular project comes down to a set of criteria that may be different in every case. They include things like journal or funder requirements, the size of the dataset being archived, the community of researchers who use a particular repository (PANGAEA for earth sciences, Dryad for ecology, NCBI for bioinformatics and sequencing, Figshare, etc), etc.

Protocols.io
Code and text to repo then archive (e.g. GitHub -> Zenodo), data to DOI archive
Supplemental data/etc with published manuscript

20.5 Getting credit for sharing

ORCID
Google Scholar

20.1 Why to share

20.2 How to share

20.3 When to share

20.4 Where to share

20.5 Getting credit for sharing

20.6 Exercises