16 Approaches to empirical and process modeling

16.1 Crunching the numbers

Chapter 15 highlighted approaches to visualize multivariate information through exploratory data analysis. Beyond visualization, you may also need to “crunch the numbers” with statistical modeling and regression. This chapter equips you with a broad overview of the varied approaches to analyze data. If you are looking for a more in-depth treatment of statistical concepts, Bruce and Bruce (2017) is a great go-to resource for quick explanations of statistical concepts and their connection to data science, Carter (2020) for more mathematical foundations, or Cannon et al. (2019) for a good overview of linear regression. Let’s begin.

16.2 Empirical modeling

The fundamental idea of empirical modeling is to establish if there is (or isn’t) linear association between different predictor variables. The variables may be quantitative or categorical. Empirical modeling aims to determine the parameters \(\vec{a}\) that describe the functional relationship \(\vec{y}=f(\vec{x},\vec{a})\) with data \((\vec{x},\vec{y})\). Two examples of empicial modeling include linear regression and logistic regression, conceptually shown in Figure 16.2.

A two level diagram with a generic scatterplot with variables x and y and a positive linear association with red dots. From this diagram are two arrows down to the left and right with replicate diagrams and additional annotations on each. The left bottom diagram includes light blue line that represents the best fit line and has the title Linear Regression. The right bottom diagram separates the y axis into to levels high and low, colored green and purple respectively with the title Logistic Regression. — Figure 16.1: Schematic diagram of two types of empirical modeling to establish empirical relationships between data (paired measurements of x and y). Linear regression establishes a functional relationship between the two variables (the so-called “best-fit line”), whereas logistic regression classifies the predictor (y) variable into two categories (represented as “high” or “low”).

16.2.1 Linear regression

For linear regression, the response variable (or the dependent variable or the one on the y axis) is quantitative. The response variable is expressed as a linear combination of the predictor variables, i.e. \(\vec{y}=a_{0} + a_{1} x_{1} + a_{2} x_{2} + ... a_{n} x_{n}\), the so called “best-fit line”. Each term of the linear regression (\(a_{0}\), \(a_{1}\), etc) is directly proportional to the corresponding predictor variable (\(x_{1}\), \(x_{2}\), \(x_{3}\), etc). The parameters of the linear regression \(a_{i}\) all measure the rate of change in the response variable for a corresponding unit increase in the corresponding predictor variable. The regression coefficients are sometimes characterized as the effect of the predictor variable.

Direct proportions

In mathematics when we say one variable is directly proportional to another variable (e.g. growth rate is directly proportional to temperature), the mathematical expression is growth rate = a\(\cdot\)temperature. We need the constant “a” here because it may not be a direct 1-1 relationship. Linear regression is used to help define that relationship. The key thing to note is that if the predictor variable increases by a factor (e.g. doubles) than the response variable will also increase by the same factor, no matter the value of the constant a. Let’s say we have \(y_{1} = a \cdot x_{1}\), and the value of \(x_{1}\) doubles, so \(x_{2} = 2 \cdot x_{1}\). Then \(y_{2} = a\cdot x_{2} = a\cdot (2 \cdot x_{1}) = 2 \cdot (a \cdot x_{1}) = 2 \cdot y_{1}\).

Linear regression can include a mixture of categorical predictor variables, or also effects of variables together (i.e. \(a_{ij} x_{i} \cdot x_{j}\), which \(x_{i} \cdot x_{j}\) could be considered a new variable \(y_{ij}=x_{i} \cdot x_{j}\)). Of note is to declare the variables as categorical or designate them as factor variables. If you have a mixture of categorical variables (i.e. not a simple binary), there may be some nuances and choices in coding these, which may also influence regression values.

Linear regression (and variants of it) are common in all computational software programs. Python uses the scikit-learn library, which also implements several other machine learning methods. Linear regression in R uses the function lm, which stands for linear models. The methods to compute linear regression also include statistical information (such as uncertainty estimates, significance, etc). Of note is the broom package for R for extracting these out in a data table friendly format.

16.2.2 Logistic regression

Moving from simple linear regression, logistic regression is used when the predictor variable is a binary response variable (such as indicating presence or absence of a species).

Logistic regression forms the basis of many classification algorithms in machine learning (Boehmke and Greenwell 2019; Carter 2020). Logistic regression predicts the probability (called \(\pi\)) of the predictor variable occurring, using the probability of 0.5 as a cutoff, as determined by the odds. The odds is defined as the ratio of the probability of success (\(\pi\)) of the variable occurring to failure (\(1-\pi\)), or expressed as \(\frac{\pi}{1-\pi}\). Because the odds has a domain of all positive numbers, in practice the log-odds is computed (\(\ln \left(\frac{\pi}{1-\pi} \right)\)), giving a range of all real numbers.

Logistic regression then models the log odds as a linear function, with the log odds being a response variable and (i.e. \(=a_{0} + a_{1} x_{1} + a_{2} x_{2} + ... a_{n} x_{n}\) as with linear regression). Algebraic manipulation then be transformed back into an equation that models the probability \(\pi\) as a logistic equation.. In R the engine that does logistic regression is glm, or in Python as LogisticRegression.

Logistic equations

A logistic equation is sometimes described as an S-curve, due to its shape. Logistic equations are often used to describe limiting values, as opposed to linear or exponential growth which increases without bound. The expression for a logistic equation is \[\displaystyle y = \frac{K}{1+B \cdot e^{-rx}}\], where the parameter \(K\) refers to the carrying capacity or limiting value, \(r\) is the growth rate, and \(B\) is a constant related to the y intercept, or where the graph crosses the vertical axis. For logistic regression, \(K = 1\).

A graph showing a curve that is in the shape of an elongated S. The x axis spans -10 to 10 and the y axis 0 to 10. A grey horizontal dashed line is present at 5 on the vertical axis The graph is titled with sample logistic function. — Figure 16.2: Sample graph of a logistic function, where \(K=5\), \(r=0.7\), and \(B=2\).

16.3 Process modeling

Another approach is dynamic modeling (Zobitz 2023). A fundamental building block for dynamic modeling is identifying (1) state variables and (2) processes that transform different state variables from one to another. One area where this could be examined is process-based ecosystem modeling. To provide a specific example, let’s focus on soil carbon and transformation. Figure 16.3 displays a conceptual diagram for a soil model, where state variables are represented with rectangular boxes and processes with arrows.

A brown box contains the text C~S~. On the top part of the box a black arrow points in, with the tail of the arrow with the text alpha NPP. A second arrow points from the box with the tip of the arrow with text kS CS. — Figure 16.3: Model diagram for a conceptual model of soil decomposition and transformation. Soil carbon (\(C_{S}\)) is transformed through net primary production (\(\alpha NPP\)) with soil microbial transformation \(k_{S} C_{S}\).

In this model we have a single pool, \(C_{S}\), that represents the amount of carbon in the soil (usually measured in grams carbon per square meter, or g C m^-2). One process that carbon accumulates in the soil is through net primary production \(NPP\), or excess carbon generated from photosynthesis after respiration from aboveground components (leaves, stems, etc). In this case a proportion of \(NPP\) enters the soil from root exudates, which we can represent mathematically as \(\alpha \; NPP\), where \(\alpha\) is a proportionality constant. In the soil, microbes utilize carbon and then respire carbon as carbon dioxide. One way to mathematically represent the transformation by soil microbes is \(k_{S} C_{S}\), where \(k_{S}\) is a rate constant.

The dynamic model then describes the rate of change of soil carbon, \(\displaystyle \frac{dC_{S}}{dt}\) with the following equation: \[\displaystyle \frac{dC_{S}}{dt} = \alpha \, NPP - k_{S}C_{S}.\]

Figure 16.3 is a heuristic model that is most likely not suitable for prediction. Other refinements may include more state variables and processes for specific soil components. As an example, Figure 16.4 shows a conceptual diagram for the Yasso model (Liski et al. 2003; Liski et al. 2005). This model has many more state variables than our heuristic model above, notably specifying different types of soil compounds (e.g. sugars, ethanol, lignin), as that affects microbial decomposition.

Three panel plot with the first column used to conceptually describe a model. The first column shows picture of a leaf, text of foliage, branches, coarse roots, stems, stumps, and a picture of a tree. The second column contains boxes of the different state variables of the Yasso model. Arrows point from the each box in the second column to the third with CO2. — Figure 16.4: Conceptual diagram of the Yasso model which has several state variables and processes that affect soil carbon transformation and decomposition. Reproduced with permission of the Finnish Meteorological Institute under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

One challenge with more complex models is parameterization. Utilization of model parameterization methods (Markov Chain Monte Carlo parameter estimation or other machine learning methods), as described next.

16.4 Bayesian modeling and machine learning

Another approach is Bayesian modeling. This approach is one that assumes that either the response variables, predictor variables, or regression parameters follow some probability distribution. This information about the distribution of variables or parameters is then incorporated into the statistical modeling, which may sharpen or define the resulting approaches. that you can incoporate that information into your workflow. For example, applying Bayeisan modeling to our heuristic model in Figure 16.3 to estimate the unknown parameters \(\alpha\) and \(k_{S}\) might assume distributions on each of the respective parameters. A dataset of soil respiration or measured soil carbon over time at a site could used to derive a data-informed estimate of the parameters \(\alpha\) and \(k_{S}\). Viskari et al. (2022) used such an approach to calibrate the Yasso model from a worldwide database of litter decomposition and soil organic carbon measurements.

Probability distributions

Saying that a value follows a probability distribution means that you are indicating a state of belief (or sometimes known as probabilities) about its value. Probabilities range from 0 to 1. A probability distribution may refer to one of two concepts: the probability density function, which specifies the value for a state of belief about a value, or the cumulative distribution function, which represents the accumulated probability across the range of possible values for a variable.

Bayesian modeling is implemented through the MCMC (Markov Chain Monte Carlo) approach, which is a sequential algorithm that randomly explore the parameter space, evaluating whether to accept or reject a function based on the likelihood function (Zobitz et al. 2011). There are many considerations when doing Bayesian parameter estimation (Clark 2007; Gelman et al. 2014; Dietze 2017) which can be computationally intensive.

We mentioned machine learning in this subsection title as Bayesian parameter analysis is essentially a type of machine learning: both use algorithms to determine an optimal set of parameters given a specific target set. Machine learning is broadly categorized into supervised algorithms, which use a defined function to associate datasets (e.g., linear regression), and unsupervised algorithms, which identify patterns or structures within data without prior labels. While a deeper dive into these evolving techniques is beyond our current scope, Boehmke and Greenwell (2019), Kuhn and Silge (2022), and Géron (2023) are excellent resources for further study.

16.5 Causal inference

An emerging topic in environmental sciences is causal inference or causation (Cunningham 2021; Byrnes and Dee 2025). Much of statistics concerns itself with association - i.e. increased temperature is associated with soil respiration. Causal inference is the study of attribution - in other words soil respiration increases as a result of increased temperatures.

Isn’t this just semantics? Not really. Causation between two variables X and Y can defined in 4 different ways (X causes Y; Y causes X, X causes Y and Y causes X, or no causality). The number of ways that causation between two variables can be determined equals \(4^{N\choose 2}\) for \(N\) variables. If you have a dataset of several variables, you need to test all these different associations.

There are several approaches in which to determine causation:

Granger causality (Granger 1969): relies on statistical associations between two timeseries variables (say \(x\) and \(y\). Briefly linear regression of \(y\) against lagged values of \(y\) are conducted. Then a second regression of \(y\) with lagged values of \(y\) and lagged values of \(x\) are conducted Causality is determined in a statistical sense to see if including lagged values of \(x\) increases explanatory power of \(y\)
Transfer entropy (Schreiber 2000): borrows concepts from information theory to measure the Kullback-Leibler divergence. In this case, the Kullback-Leilbler divergence compares distribution of \(y\) and \(x\) (both with lags) to the distribution of \(y\) by itself. Significance is often estimated through bootstrapping, which may slow down processing time.
Convergent cross mapping applies a dynamical systems approach to determine causality. It views two separate timeseries as a projection of a complex dimensional space called a manifold. The goal is to reconstruct the manifold using the lagged timeseries. A good video primer on this elegant theory is found here: LINK

While briefly outlined here, causal inference is quickly becoming a necessary tool (along with advances in machine learning) for processing large datasets that may be associated with each other (Ferraro et al. 2019; Ombadi et al. 2020; Siegel and Dee 2025).

16.6 Exercises

Chapter 2 introduced datasets of reported weather and streamflow data (Figure 2.4) for the Mississippi in St. Paul, Minnesota. Figure 16.5 displays the reported values in 2024 of maximum temperature (max_temp), minimum temperature (min_temp), and precipitation (precip) at each day. Use this information to answer the following questions:

Three panel one column plot of an annual timeseries in 2024 of maximum temperature, minimum temperature, and precipitation. Both the maximium and minimum temperature show a periodic pattern with the maximum occurring mid year, and values on the y axis ranging from 0 to 75. The precipitation is more variable, more common zero, with a range from 0 to 2. — Figure 16.5: Reported weather data at the Minneapolis/Saint Paul International Airport in Minnesota USA for 2024. `max_temp` is the maximum temperature (degrees Fahrenheit), `min_temp` the minimum tempreature (degrees Fahrenheit), and `precip` daily precipitation (inches).

a. What would be a causal diagram that might influence how the variables presented in the graph relate to each other?

b. What would be a possible scenario where you were to apply linear regression to these variables?

c. What would be a possible scenario where you were to apply logistic regression to these variables?

For logistic regression, the binary response variable is based on the probability \(P\) of the event happening, with \(P=0.5\) being the cutoff for success and failure of the binary response variable. The odds of the variable is defined as \(\displaystyle \frac{P}{1-P}\), and the log-odds as \(\displaystyle \ln \left( \frac{P}{1-P} \right)\). If \(P=0.5\), what are the values of the odds and log-odds? If \(P<0.5\)? \(P>0.5\)?

Boehmke, Brad, and Brandon M. Greenwell. 2019. Hands-on Machine Learning with R. Chapman and Hall/CRC.

Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists: 50 Essential Concepts. 1st ed. O’Reilly Media.

Byrnes, Jarrett E. K., and Laura E. Dee. 2025. “Causal Inference With Observational Data and Unobserved Confounding Variables.” Ecology Letters 28 (1): e70023. https://doi.org/10.1111/ele.70023.

Cannon, Ann, George W. Cobb, Bradley A. Hartlaub, et al. 2019. STAT2. 2nd ed. Macmillan Learning.

Carter, Nathan, ed. 2020. Data Science for Mathematicians. Chapman and Hall/CRC. https://doi.org/10.1201/9780429398292.

Clark, James S. 2007. Models for Ecological Data: An Introduction. Princeton University Press.

Cunningham, Scott. 2021. Causal Inference: The Mixtape. Yale University Press.

Dietze, Michael. 2017. Ecological Forecasting. Princeton University Press.

Ferraro, Paul J., James N. Sanchirico, and Martin D. Smith. 2019. “Causal Inference in Coupled Human and Natural Systems.” Proceedings of the National Academy of Sciences 116 (12): 5311–18. https://doi.org/10.1073/pnas.1805563115.

Gelman, Professor in the Department of Statistics Andrew, John B. Carlin, and Hal S. Stern. 2014. Bayesian Data Analysis. Chapman & Hall.

Géron, Aurélien. 2023. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media.

Granger, C. W. J. 1969. “Investigating Causal Relations by Econometric Models and Cross-spectral Methods.” Econometrica 37 (3): 424–38. https://doi.org/10.2307/1912791.

Kuhn, Max, and Julia Silge. 2022. Tidy Modeling with R. O’Reilly Media.

Liski, Jari, Ari Nissinen, Markus Erhard, and Olli Taskinen. 2003. “Climatic Effects on Litter Decomposition from Arctic Tundra to Tropical Rainforest.” Global Change Biology 9 (4): 575–84. https://doi.org/10.1046/j.1365-2486.2003.00605.x.

Liski, Jari, Taru Palosuo, Mikko Peltoniemi, and Risto Sievänen. 2005. “Carbon and Decomposition Model Yasso for Forest Soils.” Ecological Modelling 189 (1): 168–82. https://doi.org/10.1016/j.ecolmodel.2005.03.005.

Ombadi, Mohammed, Phu Nguyen, Soroosh Sorooshian, and Kuo-lin Hsu. 2020. “Evaluation of Methods for Causal Discovery in Hydrometeorological Systems.” Water Resources Research 56 (7): e2020WR027251. https://doi.org/10.1029/2020WR027251.

Schreiber, Thomas. 2000. “Measuring Information Transfer.” Physical Review Letters 85 (2): 461–64. https://doi.org/10.1103/PhysRevLett.85.461.

Siegel, Katherine, and Laura E. Dee. 2025. “Foundations and Future Directions for Causal Inference in Ecological Research.” Ecology Letters 28 (1): 3–22. https://doi.org/10.1111/ele.70053.

Viskari, Toni, Janne Pusa, Istem Fer, Anna Repo, Julius Vira, and Jari Liski. 2022. “Calibrating the Soil Organic Carbon Model Yasso20 with Multiple Datasets.” Geoscientific Model Development 15 (4): 1735–52. https://doi.org/10.5194/gmd-15-1735-2022.

Zobitz, J., A. Desai, D. Moore, and M. Chadwick. 2011. “A Primer for Data Assimilation with Ecological Models Using Markov Chain Monte Carlo (MCMC).” Oecologia 167 (3): 599–611. https://doi.org/10.1007/s00442-011-2107-9.

Zobitz, John. 2023. Exploring Modeling with Data and Differential Equations Using R. Chapman and Hall/CRC.