19 Sharing your code the open way – Environmental Data Science: a Field Guide for Navigating the Age of Open Environmental Data

19.1 Considerations for sharing

There are three things to consider when sharing code in an open format.

availability: is the code stored online at a website that can be accessed for a user?
citeable: is there a reference to the code for dissemination, or does it just refer to a website?
licensing: what are ways that the code can be modified?

We discuss each of these different considerations and the benefits and disadvantages to them.

19.1.1 Available: code storage through github and friends

Chapter 5 discussed how to utilize git, paired that with code repositories such as github. Whether you use github, gitlab, or bitbucket, the advantage these only code storage sites is version control for code development. This online version control is useful - and perhaps essential - for remote collaboration. These code storage sites have different tiers (and costs) for storage. We suspect that the free version for all sites provides enough storage to share your code.

Online code repositories are one way to build a professional presence. Chapter 18 provides guidelines and frameworks for documentation in a code repository. As you begin to build your professional presence, please do pay attention to the structure of your code. How you structure your github repository depends a lot on stylistic preferences that you have, but it is best to have some variation on this structure, as introduced in Chapter 5 (similar to Figure 5.2):

Data. May include raw data or processed data files.
Analysis. May include scripts to process data, models, or outputs
Outputs: Derived outputs from analyses and processing input data. Good to separate that these are your work
Reports: May include manuscripts, vignettes (short descriptions of code), or other documentation needed to understand the code.
README: An introductory file / landing page where someone would look first.

Setting up this file structure at the beginning of a project is preferred, rather than the kitchen sink approach and trying to make it all work together at the end. The README file (see Chapter 18), which can be a useful introduction to your space in the repository.

One of the issues with scripted code is reproducibility across different computers and operating systems. Each computer installation can be uniquely different, affecting processing time on code². Additionally, packages (or contributed code) used to generate results may be updated - or worse - depreciated, affecting reproducibility in the future. As a workaround, containers for cloud computing (see Chapter 10) aim to reproduce computational environments and results as the code creator intended it.

19.1.2 Citation: oh code, where art thou?

Sharing your code on code repositories is a good first step, although while it is archived, it may not be citable. Websites (and by association web addresses) change over time. Including a link to a code repository or website in a paper may be outdated when it is accessed in the future. The online tool zenodo (European Organization For Nuclear Research and OpenAIRE 2013) is a digital library of research artifacts for permanent storage. Notably, with zenodo you can create of a digital object identifier with your code and research products, allowing the work to be permanently citable in academic literature. In this way, digitial object identifiers is an example of one of the data feminism principles of giving credit where credit is due (D’Ignazio and Klein 2020).

19.1.3 Licensing: building the (re)mix tape

We imagine that you are familiar with the idea of copyright from media - works of art (which includes literature, film, art, music, and others) all have a copyright that provides legal protection in case the work (or parts of work) are sampled or redistributed with someone else claiming authorship.

In the same vein as copyright, choosing a license for your code, which specifies in what ways your research outputs can be reproduced. While one suggestion is to just “pick a license”, having an understanding of the nuances between them is helpful. Creative commons licenses are an answer to the rigid copyright structure. These licenses are structured so that you can retain the copyright license to your work, but also put the work in the public domain for others to use. There are six types of creative commons licenses across a spectrum. Licensing for code is different because of the way it might be used, however here is a rundown of some common terms and licenses you may see:

The most open is CC0, which means that you put the work in the public domain - essentially you are just posting the work to the world. In the United States works of art enter the public domain after a specified length of time - currently as of this writing, which is 95 years. We don’t recommend that CC0 as a go-to, as sometimes you get some strange mashups - such as Winnie the Pooh horror films.
Another license is CC BY means that others can use, adapt, and remix your work, but must give you credit for the work. This is the least restrictive license.
On the other hand, the most restrictive is CC BY NC ND, which means the work needs attribution, no commercial distribution, and no derivatives (ND). The additional suffix SA means that adaptations must be shared under that same terms (i.e. you if a copyright is CC BY SA, it cannot be changed to CC0).
The MIT license allows the secondary user to modify and distribute the code with minimal restrictions, but the original copyright needs to be carried forward. However, the original copyright notice and license text must be included in all copies or substantial portions of the software.
The Gnu Public License (GPL) is an example of copyleft licenses that allow the end user to share, modify, and redistribute the work freely. The caveat is that code licensed as GPL in derivative works cannot be modified to a more restrictive license, essentially keeping its openness in perpetuity.
The final option is not specifying a license, or putting the code into the public domain. In this way, you are essentially putting your code in the public domain for others to use. But, if you have spent the time in writing specialized code, we always recommend specifying a license.

19.2 Sharing is easy; maintaining is golden

Getting to the step of sharing out your project for the larger community is a momentous step; this is the first step in a journey to maintain and provide a useful resource to your community of users. Working with online code repositories also provide a level of interactivity with for the larger community.

One particular way your code interacts with the broader community is through issues. Both github or gitlab allows for users to leave notes, request features, or point out the inevitable error in the code. Issues are a vital partner to network with the broader community and understand the impact of your work. It is nice to be seen and know that your work is being used in the world (even if an issue oftentimes means something may be amiss with your code).

If you get an issue don’t panic - sometimes the suggestions are ways to make things better. Receive feeding from users is an introductory example of inclusive coding, or incoding (Buolamwini 2023). We also recommend periodic code review (either from independent reviewers or trusted colleagues) to provide feedback and suggest modifications (Peters 2025).

19.3 A cautionary tale of using AI

Large language models and assistive AI are good resources to help generate answers when troubleshooting errors, or as a helpful assistant to remind yourself about syntax. We urge you to trust, but verify any code. Here’s a cautionary tale from John:

John is the author of and R package called demodelr (Zobitz 2022), which provides tools and support for solving differential equations with R. As he was preparing a class lecture that utilized this package, he forgot how the syntax for a particular function worked. On a whim, John asked Google Gemini to tell him more about the package. Figure 19.1 shows what is the AI returned:

A series of text paragraphs from an AI prompt with highlighted text in green. — Figure 19.1: Gemini output from a prompt asking about a code package the co-author wrote.

Ok. So far so good. The initial response is in line with what he expected. But Gemini kept going, and that is where things became … interesting in Figure 19.2:

Yes, the demodelr package does have the functions euler and rk4, but gillepsie, not so much. It does have support for Markov Chain Monte Carlo parameter estimation, but as author, John was confident that gillepsie was not correct. Recognizing that AI can hallucinate, I decided to test who was the author of this package, with the output from a prompt about me in Figure 19.3:

A series of text paragraphs with a response from an AI prompt. — Figure 19.3: Gemini output from a prompt asking about the author of the `demodelr` package. It gave the incorrect author.

Now he started to get worried. Could there be two packages with the same name? John independently that L. Scott Mills is at the University of Montana. From what he could tell, he has not authored a similar package named demodelr. So Gemini needed correction (Figure 19.4):

No shade to Google gemini, but this was a cautionary tale to (cautiously) “trust, but verify”. AI is only as good as the data that trains the model.

19.4 Exercises

Investigate the websites github, gitlab, and bitbucket. Which makes the most sense for online code storage for your uses?
Review guidelines for Python packaging as well as the tidyverse style guide. What similarities and differences do you notice between them?
Work through the process of filing and resolving an issue on one of your own online code respositories.
Assign a license to a code project that you have developed.