10  Data Science Workflows II: Managing your computational environment

Now that we have discussed aspects of a data science project, let’s talk about manging your computational environment This can be a deeply personal (and divisive) topic, but gets at the heart of the idea of workflow versus product. An intentional focus on project management and workflows can increase your overall efficiency and ease the transition to working in collaborative teams. In this chapter we’ll provide some behind the scenes case studies of projects that we have worked on. Let’s begin.

10.1 We all care about the environment

Reproducibility can be challenging. A typical data science project may call in several external libraries or packages, each with their own particular version. Contributed code on CRAN is continuously updated. It is not uncommon for a library to advance another version from when you began a data science project.

In his mapping project, John used the library modisTools (Hufkens 2023), which allowed for an easy interface between R and accessing remote sensing products from MODIS. After the initial library was released, access to MODIS remote sensing products required additional protocols, which prevented package functionality when I went back to visit the map several months later. In addition, I updated my version of the R language, and my version of the modisTools package was no longer supported. Only after the package was updated could I use it again.

I eventually found a workaround in the meantime, but this sort of process is common when working with reproducible code. If I had shared my project out, others may not be able to reproduce the code.

Fortunately there are several tools available for you to manage this, or provide a reproducible environment on your computer:

  • renv is a systematic way that tracks which versions of R packages you have incorporated into a project, and allows for easy sharing of all libraries used with collaborators.
  • poetry is a python specific package and environment management tool, which is similar to functionality as renv.
  • conda is a language agnostic environment manager, which can be used across Python, R, or jupyter notebooks. See also mamba
  • pip (Python)
  • uv (Python)
  • virtualenv (Python
  • pyenv for different python versions

Note: requirements.txt for python projects

Fortunately there are several tools available for you to manage this, or provide a reproducible environment on your computer:

COPIED FROM CHAPTER 10 - just saving these in case we want to reuse them - renv is a systematic way that tracks which versions of R packages you have incorporated into a project, and allows for easy sharing of all libraries used with collaborators. - poetry is a python specific package and environment management tool, which is similar to functionality as renv. - conda is a language agnostic environment manager, which can be used across python, R, or jupyter notebooks. - burrito recognizes that lab work is more than just computer work, so this captures a more realistic workflow (Guo and Seltzer 2012) - docker is a cloud-based environment management system. A fundamental concept of docker is a container, which is an independent computing environment. - sumatra

At the start of a project use one of the above tools to track your environments, which will then keep track of all the specific versions of libraries you are using locally. At the very least it will then allow for easy export of these libraries in a README file or other documentation.

10.2 Managing the computational environment across many languages and tools

  • Docker
  • Singularity
  • Virtual machines, local and cloud (Virtual Box, vagrant, VMWare, EC2 images, etc)

10.3 Exercises

  1. Choose any of your current computational projects that just uses a single language (R, python, etc) and make a list of all of the packages and versions you use.
  2. Now do the same with a project using more than one language.
  3. Now also include system dependencies required for your code to run.
  4. Make a project sustainability plan. How will you do your best to allow your code to be run in other contexts – e.g. by other researchers or by yourself in the future?