Skills

Now to the fun part! There are a lot of concepts that are important for an environmental data science to have competency with. This section focuses on providing a high-level overview of some of the approaches that can be used to work with your data.

We begin by introducing a few more computer languages that you may want to consider developing some skill in as you progress in your career. Think of these as second languages that enable you to handle more specialized cases of analysis, for example using relational databases to summarize very large data that is too large for the memory of even the largest cloud server, or a scripting language like bash, which enables you to build computational workflows on *nix machines and to better manage complex computational environments and software dependencies.

In 12 AI for Modern Environmental Data Science, we give an overview of the broad category of ‘Artificial Intelligence’ and what it can mean for your work. Artificial Intelligence (AI) and Machine Learning (ML) have been available approaches for decades, but have reached a new level of uptake and awareness due to the remarkable advances in generative large language models like ChatGPT from OpenAI, Claude from Anthropic, or Gemini from Google. We suggest ways that these modern generative Large Language Models (LLM) tools can help you be a better environmental data scientist, and help alert you to areas where caution is advised.

With that stage set, we move into a set of chapters that introduce you to several conceptual areas relevant to working with scientific datasets. We start with an introduction into some of the approaches used to manipulate (or “wrangle”) datasets to make them ready for analysis or visualization in 13 Wrangling and joining data. Following that in 14 Iteration without tears, we introduce the extremely powerful approaches for applying the same processing code to a set of data objects – either hundreds or thousands of individual files, or numerous parts of a single large dataset. While those with some basic programming background may be familiar with the idea of using sequential for loops to accomplish this type of task, we also highlight more functional approaches based on map-reduce concepts.

With your data wrangled and processed, next up come visualization. We devote 15 Approaches to visualizing data to a high-level conceptual discussion of how to make good data visualizations. Think beyond just scatterplots and bar charts and consider what point you’re trying to make before deciding how to communicate it through a visualization. We look at the same dataset several different ways to demonstrate the effect that visualization choices can have on the overall take home message of a plot as well as its emotional effect on the viewer.

The last major component of most analyses is some sort of modeling. We cover the major types of modeling, from statistical to process-based, in 16 Approaches to empirical and process modeling. While this chapter could form the basis of an entire volume, our goal is to give you a sense for what category of modeling might be most appropriate for your project’s goals and objectives. Sometimes a simple linear regression is all you need, and other times it doesn’t quite do the job.

We wrap up this section with 17 Data Science Workflows III: Workflow execution approaches and tools that discusses the ways to tie all the parts of your analysis together into a coherent workflow. We introduce some of the software tools that can be used to efficiently execute complex interdependent projects. These tools (e.g. the venerable make and the Python-based snakemake) help create machine-executable documentation of the way that the pieces of your project fit together and how outputs from one script become inputs to a later one.

By the end of this section, you should have a grasp of the major areas of analysis skills you’ll need to handle most scientific data analysis tasks.