12  AI for Modern Environmental Data Science

AI – spectrum of logistic regression to general intelligence What is AI for Environmental Data Science? LLMs - Machine Learning, Random Forest Models, support vector machines (traditionally throught of us machine learning) Modern Data scientist needs understanding of spectrum of machine learning tools - where can this be. Enormous value Moore’s Law for computing, innovation going on current examples at the time of publication.

12.1 Introduction

The use of machine learning and ‘Artificial Intelligence’ (AI) has exploded in recent years, but it can be challenging to keep up not only because of rapid tool and method development, but also because definitions of what constitute AI differ even among experts in the field.

Some say xyz, others pqr, etc

This chapter will give you an overview of some of the approaches and terminology in this rapidly evolving area and point you to some resources to learn more about the many tools and techniques available.

12.2 What do people mean when they say AI?

12.2.1 Statistics

12.2.2 Machine Learning

12.2.3 Deep Learning

12.2.4 Large Language Models (LLMs)

12.2.4.1 Strengths of LLMS

Summarizing text, lower barriers for intercultural language translation, personalized feedback, help with coding

12.2.4.2 Weaknesses of LLMs

hallucinations, environmental cost, data security/privacy issues

12.2.4.3 Types of LLMs

Model size, raw API vs chatbot, ‘thinking’ LLMs

12.2.4.4 Local and Open LLMs

Ollama, Hugging Face, etc

12.2.5 Case studies of Machine Learning and AI in the Environmental Sciences

12.2.5.1 Code translation from one language to another

Rather, let’s examine differences between common environmental data science languages using the power of chatGPT.

Generative AI is an offshoot of machine learning methods from data science, and so provides a good case study to examine differences environments for data science. These artificial intelligence tools (e.g. chatGPT and others) have rapidly transformed our daily lives (especially post 2023) and how we interact with the internet. For scientific research disclosing the use of generative AI tools is recognized as maintaining scientific integrity (Bockting et al. 2023; American Journal Experts (AJE), n.d.; Bertolo and Antonelli 2024). Let’s use them here to contrast how the different environments in Table 3.1 produce output.

If we chose R, a prompt to chatGPT might be the following Figure 12.1:

Screenshot of a prompt to chatGPT saying write code to plot halfhourly temperature data using R with tidyverse syntax.
Figure 12.1: A prompt to generative AI (chatGPT), asking for help to in plotting data using R.

Figure 12.2 shows its response, which (admittedly) a well-organized (and documented!) explanation of starter code:

chatGPT output of a prompt saying write code to plot halfhourly temperature data using R with tidyverse syntax.
Figure 12.2: Snapshot of the response from generative AI on how to plot half hourly temperature data using R and tidyverse syntax. Notice the well documented code!

The provided code loads up the correct library (tidyverse), converts the time to the POSIXct format (which makes working with dates and times easier) and generates a well-labeled plot. Not too shabby. Based on your knowledge of R, we would also award extra credit points for using the tidyverse pipe (%>%) in the code, but perhaps not full credit because of the adoption of the base R pipe (|>).

Now let’s give the same prompt with python (Figure 12.3):

chatGPT output of a prompt saying write code to plot halfhourly temperature data using python.
Figure 12.3: Snapshot of the response from generative AI to construct code to plot half-hourly temperature data in python.

Examining the code seems like a beat for beat rehash of the same code with R, but now with python (like how Star Wars: The Force Awakens was reviewed - please don’t @ us!). There are some differences to note:

  • Similar to R, in python libraries are defined at the start of the code (with the command import). However, those libraries need to be referred to when you wish to use a command from a particular library (e.g. the function to_datetime is a function in the pandas library). Thankfully in python you can identify shortcuts to these libraries, such as the code that says import pandas as pd - whichever makes sense for you. (NOTE: In R if you only want to use a particular function in a library, then refer to it with the double colon (::, e.g. PACKAGE::FUNCTION) ).
  • Python doesn’t have the native pipe operator (|>) like R. The assignment operator is equals (=) versus a left facing arrow (<—).

We finally asked chatGPT the same prompt, this time with Julia (Figure 12.4):

chatGPT output of a prompt saying write code to plot halfhourly temperature data using julia.
Figure 12.4: Snapshot of the response from generative AI to construct code to plot half-hourly temperature data in Julia.

Can you spot the differences (and similarities) with the Julia output in Figure 12.4 compared to R (Figure 12.2) or Python (Figure 12.3)? For all practical purposes, it comes down to preference - which one you are more familiar with.

12.2.5.2 Image Processing/Classification/Annotation

12.2.5.3 Audio Processing/Classifying

12.2.5.4 Other

12.2.6 Environmental concerns with AI use

Training models Running models Power and water requirements Carbon footprint What types of models use more or less energy and resources to run

12.3 Sources

Generative AI link: LINK

12.4 Exercises