Environmental Data Science: a Field Guide for Navigating the Age of Open Environmental Data

Author

Naupaka Zimmerman and John Zobitz

Published

June 3, 2025

Introduction

Developing an understanding of the world around us is one of the first major tasks we take on as children. Through exploration and play, we learn about our environments and use that experience-based understanding to build mental models of how different places and organisms shape our lives. We are born to do this, and don’t need much more than our physical senses to gather sensory data and our minds to work out the patterns.

Modern technological progress has enabled humans to ‘see’ things that weren’t visible to us before, both because they may be much too large (the cloud cover over an entire continent) or too small (the bacteria on the surface of plant roots). These expansions in data come from remote sensing by drones, planes, and satellites, from the massive capability of modern DNA and RNA sequencing methods, and from the enormous streams of data pouring in from continuous distributed environmental sensor networks. These measurement modalities have been used to enhance our understanding of natural, built, and agricultural ecosystems. They have enabled us to measure things much more precisely and to compare measurements from one place to another with a high degree of confidence. We have developed technologies that can now do the sensing for us, and record what they’ve observed in the form of high-resolution digital data.

These technological shifts over the last centuries have brought enormous insight and, for many, deepened our sense of wonder for the natural world. However, being able to use and properly interpret the data derived from these new modalities requires a much more complex set of skills than a childhood walk in the woods or swim in the lake.

Over the last couple decades, a new field has emerged that encompasses the practices needed to work with the ever-growing size and complexity of environmental data sets. Environmental data science brings together domain knowledge about ecological processes with skills in statistics, computer programming, and ethical understanding. There has never been a more important time to engage with this new field. The scale and rate of change that ecosystems and our biosphere are experiencing are unprecedented in human history. We have developed the instrumentation to assess many of these changes in near real time, but working with these types of data requires an understanding of environmental and ecological domain knowledge as well as a firm grasp of the technical skills to work fluently with very large, heterogeneous, and often messy datasets.

Why we wrote this book

We wrote this book to fill a niche we observed as university faculty teaching data science to ecologists and environmental scientists. We had trouble finding a succinct source that we could hand to advanced undergraduates or beginning graduate students that provided a ‘Hitchhiker’s Guide’ to the discipline, particularly one focused primarily on how to use data science tools and concepts to work with environmental data. There are a number of books in this area focused on the specifics of one particular aspect of this domain, like how to conduct GIS analyses with remote sensing data, or how to use the R programming language for data visualization. What we were looking for, and didn’t find, was a book that gave the 10,000 foot view of the many diverse areas required to become a skilled practitioner in Environmental Data Science. Instead of doing a deep dive on the ins and outs of spatial statistics, or a treatise on methods for visualizing the outputs of atmospheric models, we have aimed to create a reference that helps students and professionals understand the major areas to begin working with large environmental data sets, the main software approaches and tools for each of these areas, and a high-level view how these tools fit together as parts of an integrated data science workflow.

Who this book is for

Our mental model of our readers are those with some disciplinary background in the earth, ecological, and environmental sciences, but who are just starting their journey into computational analyses. This includes advanced undergraduate students, who may have had a semester or two of statistics, and some introductory science courses, but may not have had much prior exposure to programming or computational data analysis. It also includes graduate students who are diving into thesis or dissertation projects and wondering how to keep all the digital pieces working together and what direction they should take when analyzing the data they’ve just collected. It also includes professionals in the environmental sector who are looking to build on their domain knowledge by adding computational skills to their field or laboratory expertise.

For example, we aim in this book to explain how and when we use an application called Docker to manage the software dependencies for an analysis project, not how the Docker program works on a technical level. Our goal is not to tell you how ssh works to encrypt your session when connecting you to a remote server in the cloud; our goal is to tell you how and when you might want to use ssh to connect to computational resources that allow you to process datasets that have become too big to fit on your laptop.

What this book is not

This book was written as a gateway for environmental scientists to engage with the computational skills and tools that are needed to work with modern environmental datasets. It is not meant to be an advanced treatment of algorithm development or complex statistical analysis for those who may already possess many of these skills. There are many existing books and journal articles that address those more technical aspects of this domain. However, while we target our content to the advanced beginner, we also have made a point throughout this book to highlight particularly useful or relevant external sources for the reader who wants to take a deeper dive into a given topic.

Certainly explanations for much of the content in this book could be found from structured prompts to large language models or generative AI. Across our careers, we acquired much of our knowledge in environmental data science from deep reading of texts. We acknowledge that our understanding is imperfect. We aim to produce a cohesive narrative (in our minds) of how different pieces and domain knowledge in environmental data science fits together, and that it helps to jumpstart your own understanding. Each chapter introduction ends with the phrase “let’s begin”, which is an invitation for you to participate and contribute to this collective knowledge sharing.

Getting your system up and running

We are assuming for the purposes of this book that you are using a Unix or Linux compatible system. This category of operating system is also often referred to as being POSIX compliant. POSIX is a set of standards that help ensure better compatibility across different operating systems. In practical terms, it means you are using either a computer running MacOS, some flavor of Linux/Unix, or Windows with the Windows Subsystem for Linux (WSL) installed. Standardizing in this way allows us to focus on approaches that can be applied across all of these platforms and not spend pages and pages describing how things happen differently across different operating systems. This has the additional benefit of standardizing on a set of tools and practices that are frequently available as part of free and open source software ecosystems, which means more people can use them in more types of situations.

Acknowledgments

We would like to thank our editor, Lara Spieker at Taylor and Francis for her unwavering support for this book and her understanding each time we asked for an additional extension. We know that without that support and trust we wouldn’t have been able to pull this off.

John would like to thank: his family Shannon, Colin, Grant, and Phoebe for their patience as this book was completed; Augsburg University for an institutional home that has helped build not just a career but a vocation; Naupaka for agreeing to do this crazy idea; and for all the other students and colleagues that have helped shape his experiences.

This book has been informed through participation in grants funded by public and private organizations.¹ Many of the datasets in this book are a product from public investments in science. We contend that investment in science is a public good for everyone. Our grants support not just our scientific research, but additionally undergraduate educational and research experiences. Many of our students are the “first”, as in the first in their family to attend university, or the first time they have explored scientific research outside of the classroom. We are privileged and humbled to witness their excitement at the wonder that comes from explorations in environmental data science. Public support of science makes that wonder possible. We offer this book in appreciation for ongoing public support of science.

Much like Senator Raphael Warnock’s declaration that “a vote is a kind of prayer for the world we desire for ourselves and for our children,” we hope the knowledge shared herein contributes to creating a world that we collectively desire. Let’s begin.

For John: NSF #2017829, #2303556 , #2321958, #2519926; Fulbright Finland Foundation and Saastamoinen Foundation Grant in Health and Environmental Sciences. For Naupaka:↩︎