4 Ethics in environmental data science
In early 2014 the journal Nature published a study that developed a new technique to manipulate stem cells into different cell types (Obokata et al. 2014). This study attracted more public interest than usual because the technical advancement would be a significant leap in the application of stem cells to treat disease. However, later that year, the journal retracted the study due to issues in replication and poor data management (“STAP Retracted” 2014). The issues surrounding the retracted study are complex but nonetheless serve as a case study for the need to discuss ethics in environmental data science.
Over 3 million science and engineering papers are published each year (National Science Board 2023); the number of retracted papers each year continues to increase (Fang et al. 2012; Van Noorden 2023). Lievore et al. (2021) found that data issues are one of the main reasons for retractions, which cuts to much of the work that we do as environmental data scientists. We consider retraction of a paper to be a final step in a long process that could be avoided. We argue considerations of ethics is a high priority at the start of a data science project, which we begin in this chapter. Let’s begin.
4.1 Defining ethics
The Merriam-Webster dictionary defines ethics as “the principles of conduct governing an individual or a profession” (Merriam Webster 2025). Along those lines, professional societies all have ethical codes of conduct (for examples, see statements by the American Geophysical Union, European Geophysical Union, or the Ecological Society of America). These ethics and the codes of conducts provided here are a good place to start, but perhaps speak broadly the an entire profession.
In data science, ethics also extends to how data are collected, analyzed, and reported. Many times data science works to build prediction models and forecasts about people, which unintentionally creates a weapon of mathematical destruction (O’Neil 2016) or algorithmic bias (Noble 2018). The proliferation of artificial intelligence also extends to that understanding in new ways to enforce and maintain inequality (Benjamin 2022; Buolamwini 2023). While much of the measurements in environmental data science are from non-humans, it still does not recuse one from considerations of ethics.
4.2 Embrace open workflows
An intentional shift towards an ethical environmental data science model is a commitment to practicing open science: open data, open workflows, open products. The trend towards open science is supported by funcing agencies such as the National Science Foundation. Open science could mean:
- Data that are accessible to the broader community (not restricted to sole use by the principal investigator). Many environmental networks allow for access to their data (e.g. National Ecological Observatory Network, Environmental Data Initiative, and others). We explore more how to access these data in Chapter 7.
- Workflows that are open and translatable to different in computational environments. Science publication (as always) requires one to publish methods to produce a result; open workflows provide for outputs that are machine independent. Considerations of open workflows are discussed more in Chapter 17
- Development of products that for you to share your work with the broad scientific community. These products then magnify the impact of your work, and include sharing code (Chapter 19) and data (Chapter 20) broadly.
Transforming to an open workflow might help avoid the unnecessary (and embarrassing) retraction of data. Figure 4.1 posits a workflow broken into three pieces: Collect → Analyze → Communicate, inspired in part by similar workflows proposed in Wickham and Grolemund (2017) and D’Ignazio and Klein (2020). Let discuss what each of these workflows separately.
4.2.1 Collect
Open environmental data, such as those that are provided by the National Ecological Observatory Network (NEON) have open data as a core value. Theoretically, since the data are all open with NEON it reduces access barriers since anyone with an internet connection can access the data. Arguably, open data (and access to it) is what drives many environmental data science projects. This is related to data equity and as a related point, data equity literacy.
While there isn’t a universal standard (and perhaps Justice Potter’s definition of “I’ll know it when I see it.” may not entirely apply, at a minimum we posit that open data are data that can be read into any computational programming language with a minimum of user fiddling. A commonly used framework are the FAIR principles (Wilkinson et al. 2016). FAIR stands for:
- Findable
- Accessible
- Interoperable
- Reproducible
All aspects of the FAIR data principles reflect different stages of the data collection and analysis. While not an exhaustive list, some guiding principles for each include:
- Where will the data be stored during each stage of the process?
- Will the data be accessible to everyone, or just people currently analyzing the data?
- Are the formats used for data storage (including model outputs) accessible across a variety of software programs (Excel vs csv)?
- Will the analysis require a specific software program to analyze?
- How will the analysis of the data be structured?
- Will the code be commented directly, facilitating someone else to reproduce the work?
- Will the ouputs be stored in a format that can be used
- What is the maintenance of the data should it be used in other studies?
For example, metadata standards for ecological data allow the easier synthesis of data across studies, touching on all aspects of the FAIR data principle (Jones et al. 2019; Dietze et al. 2023). Later chapters (Chapter 17 - Chapter 22) all touch on ways to make the workflow better.
4.2.2 Analyze
Scientific papers universally require a methods section, where the authors should provide sufficient description of how they went about their analyses (Schimel 2011). We know a challenging constraint to any methods section is journal space - and it can become a challenging balancing act to figure out what to (or not) to include. Taking a backwards design approach to sharing out the analysis is a helpful framework (Zelner et al. 2022).
Open environmental data science analyses are different to a tightly controlled laboratory experiment. Analysis could include more description of how data sets were accessed and wrangled to produce results. There is a spectrum of ways this can be done: through including a folder for analysis (or better: version controlled through zenodo or other repositories), interactive Jupyter notebooks, or docker containers that provide a self-contained computing environment to aid in reproducibility).
4.2.3 Communicate
Open science is quickly becoming the default workflow. Funding agencies have implemented mandates for open science (Open Access Network 2024). Projects funded by the United States National Science Foundation are required to provide open data, publications, and other project materials available immediately upon publication in a public repository (National Science Foundation 2024).
Scientific journals also provide incentives or requirements for open data. In many cases data and models are to be archived with a digital object identifier (see examples from Geoscientific Model Development, journals published by the American Geophysical Union, or Springer). Chapter 5 and Chapter 19 introduces more aspects for version control for data and models.
A secondary benefit of communicate is that data scientists learn from each other by sharing out their work (such as the tidyTuesday social data project). So rather than shoehorn what you did, the communicate step provides an opportunity to leverage tools such as Jupyter, quarto, or colab notebooks to multiply out your impact.
While we’ve presented the open workflow of Collect → Analyze → Communicate as uni-directional, there is no natural progression. An initial analysis into data may necessitate further data collection to support research hypotheses. Collaboration in teams requires communication, which then spurs additional analyses.
4.3 Data, power, ethics
Environmental data science, in many ways is about counting the world around us as a tool to create understanding. In some respects, data science is an ethical act that decides who gets counted and in what ways people are counted (D’Ignazio and Klein 2020). How we count and the ways in which we decide if a result is valid is not necessarily neutral. The scientific method of hypothesis testing is what some scholars argue is a colonized paradigm (Hira 2015; Thakur 2023) that ignores the contribution of indigenous knowledge to advance scientific understanding (see studies by Mazzocchi (2006); Wheeler and Root-Bernstein (2020); and Jessen et al. (2022) as places to begin to learn more).
Another consideration is who has access to science. Open science, even with the aim of increasing access, may privilege more well-resourced institutions (Ross-Hellauer 2022). Considering in what ways data are shared (and access to it) should be discussed at the outset of a project. We would be well positioned to center voices of those impacted by open science (Sanjana 2021). We tackle questions of authorship in Chapter 22.
Often times conversations that seek to broaden understanding the implicit biases rooted in the scientific method are framed as “difficult” conversations. These should not be “difficult” conversations to have, but perhaps could be considered rather are unpracticed (Jett 2020). Reframing them in this way helps broaden understanding of how to address systemic barriers to access in the sciences.
Similar to developing understanding and knowledge across the lifecycle for a data project, ethics is not relegated to a single topic or chapter. Future chapters investigate walls that prevent understanding in environmental data science (Chapter 9), the spectrum of artifical intelligence in a data science project (Chapter 12), and facilitating equitable authorship models (Chapter 22). Many of the topics discussed in this chapter intersect with broader societal issues related to data, highlighting how ethics in environmental data science both shape and are shaped by our understanding of the wider world.
4.4 Exercises
Review codes of conduct for two different professional societies (some suggestions are listed in Section 4.1). What commonalities do you notice between them? Anything different or novel? What surprised you about the codes of ethics?
Explore the different types of open-access publishing. How would they fit in with the FAIR data principles?
Compare and contrast accessing fees for two journals in your area of environmental data science. How are they committed to open access? What are the publication fees/costs associated with each?
If you work in a lab or research group, do an audit of the data practices. How well do they conform to the FAIR principles?
Review the following organizations that seek to promote equity in data science:
- Ecological Forecasting Initative
- Institute for the Quantitative Study of Inclusion, Diversity, and Equity
- Data For Black Lives
- Design Justice
- Data Visualization and Human Rights
- Data Umbrella
- We All Count
What aspects of ethics for data science do you see having in common with your work? What additional considerations or ideas did these organizations have you think about?