8 Linear Regression and Curve Fitting

8.1 What is parameter estimation?

Chapters 1-7 introduced the idea of modeling with rates of change. There is much more to be stated regarding qualitative analyses of a differential equation Chapters 5, which we will return to starting in Chapter 15. But for the moment, let’s pause and recognize that a key motivation for modeling with rates of change is to quantify observed phenomena.

Oftentimes, we wish to compare model outputs to measured data. While that may seem straightforward, sometimes models have parameters (such as $k$ and $\beta$ for Equation 6.1 in Chapter 6). Parameter estimation is the process of determining model parameters from data. Stated differently:

Parameter estimation is the process that determines the set of parameters $\vec{\alpha}$ that minimize the difference between data $\vec{y}$ and the output of the function $f(\vec{x}, \vec{\alpha})$ and measured error $\vec{\sigma}$.

Over the next several chapters we will examine aspects of parameter estimation. Sometimes parameter estimation is synonymous with “fitting a model to data” and can also be called data assimilation or model-data fusion. We can address the parameter estimation problem from several different mathematical areas: calculus (optimization), statistics (likelihood functions), and linear algebra (systems of linear equations). We will explore how we define “best” over several chapters, but let’s first explore techniques of how this is done in R using simple linear regression.¹ Let’s get started!

Note

This chapter uses following libraries: ggplot2, and tibble (all part of the tidyverse), as well as the stats library (which should be automatically loaded when R begins). See Section 2.3 on how to install packages. Prior to running any code in this section, include library(tidyverse).

8.2 Parameter estimation for global temperature data

Let’s take a look at a specific example. Table 8.1 shows anomalies in average global temperature since 1880, relative to 1951-1980 global temperatures.² This dataset can be found in the demodelr package with the name global_temperature. To name our variables let $Y=\mbox{ Year since 1880 }$ and $T= \mbox{ Temperature anomaly}$.

Table 8.1: First few years of average global temperature anomalies. The anomaly represents the global surface temperature relative to 1951-1980 average temperatures.

year_since_1880	0.00	1.00	2.00	3.00	4.00	5.00	6.00
temperature_anomaly	-0.17	-0.08	-0.11	-0.18	-0.29	-0.33	-0.31

We will be working with these data to fit a function $f(Y,\vec{\alpha})=T$. In order to fit a function in R we need three essential elements, distilled into a workflow of: Identify $\rightarrow$ Construct $\rightarrow$ Compute

Identify data for the formula to estimate parameters. For this example we will use the tibble (or data frame) global_temperature.
Construct the regression formula we will use for the fit. We want to do a linear regression so that $T = a+bY$. How we represent the regression formula in R is with temperature_anomaly ~ 1 + year_since_1880. Notice that this regression formula must include named columns from your data. Said differently, this regression formula “defines a linear regression where the factors are a constant term and one is proportional to the predictor variable.” It is helpful to assign this regression formula as a variable: regression_formula <- temperature_anomaly ~ 1 + year_since_1880. In Section 8.3 we will discuss other types of regression formulas.
Compute the regression with the command lm (which stands for linear model).

That’s it! So if we need to do a linear regression of global temperature against year since 1880, it can be done with the following code:

regression_formula <- temperature_anomaly ~ 1 + year_since_1880

linear_fit <- lm(regression_formula, data = global_temperature)

summary(linear_fit)


Call:
lm(formula = regression_formula, data = global_temperature)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35606 -0.13185 -0.03044  0.12449  0.45518 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.4880971  0.0296270  -16.48   <2e-16 ***
year_since_1880  0.0076685  0.0003633   21.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1775 on 140 degrees of freedom
Multiple R-squared:  0.7609,    Adjusted R-squared:  0.7592 
F-statistic: 445.6 on 1 and 140 DF,  p-value: < 2.2e-16

What is printed on the console (and shown above) is the summary of the fit results. This summary contains several interesting things that you would study in advanced courses in statistics, but here is what we will focus on:

The estimated coefficients (starting with Coefficients: above) of the linear regression. The column Estimate lists the constants in front of our regression formula $y=a+bx$. What follows is the statistical error for that estimate. The other additional columns concern statistical tests that show significance of the estimated parameters.
One helpful thing to look at is the Residual standard error (starting with Residual standard error above), which represents the overall, total effect of the differences between the model predicted values of $\vec{y}$ and the measured values of $\vec{y}$. The goal of linear regression is to minimize this model-data difference.

The summary of the statistical fit is a verbose readout, which may prohibit quickly identifying the regression coefficients or plotting the fitted results. Thankfully R has a function called predict which creates a vector of the fitted data.³

First we use the function predict to create a vector of the predicted global temperatures from our model coefficients (called fit_data). After that we add that column to our data table using the function cbind, which stands for “column bind”. The following code shows how to do this on your own:

fit_data <- predict(linear_fit, data = global_temperature)
global_temperature_model <- cbind(global_temperature, .fitted = fit_data)


glimpse(global_temperature_model)

Rows: 142
Columns: 3
$ year_since_1880     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ temperature_anomaly <dbl> -0.17, -0.08, -0.11, -0.18, -0.29, -0.33, -0.31, -…
$ .fitted             <dbl> -0.4880971, -0.4804286, -0.4727600, -0.4650915, -0…

I like appending _model to the original name of the data frame to signify we are working with modeled components. There is a lot to unpack with this new data frame, but the important ones are the columns year_since_1880 (the independent variable) and .fitted, which represents the fitted coefficients.

Finally, Figure 8.1 compares the data to the fitted regression line (also known as the “best fit line”).

ggplot(data = global_temperature) +
  geom_point(aes(x = year_since_1880, y = temperature_anomaly),
    color = "red",
    size = 2
  ) +
  geom_line(
    data = global_temperature_model,
    aes(x = year_since_1880, y = .fitted)
  ) +
  labs(
    x = "Year Since 1880",
    y =  "Temperature anomaly"
  )

Figure 8.1: Global temperature anomaly data along with the fitted linear regression line.

A word of caution: this example is one of an exploratory data analysis illustrating parameter estimation with R. Global temperature is a measurement of many different types of complex phenomena integrated from the local, regional, continental, and global levels (and interacting in both directions). Global temperature anomalies cannot be distilled down to a simple linear relationship between time and temperature. What Figure 8.1 does suggest though is that over time the average global temperature has increased. I encourage you to study the pages at NOAA to learn more about the scientific consensus in modeling climate (and the associated complexities - it is a fascinating scientific problem that will need YOUR help to solve it!)

8.3 Moving beyond linear models for parameter estimation

We can also estimate parameters or fit additional polynomial models such as the equation $y = a + bx + cx^{2} + dx^{3} ...$ (here the estimated parameters $a$, $b$, $c$, $d$, …). There is a key distinction here: the regression formula is nonlinear in the predictor variable $x$, but linear with respect to the parameters. Incorporating these regression formulas in R modifies the structure of the regression formula. A few templates are show in Table 8.2):

Table 8.2: Comparison of model equations to regression formulas used for R. The variable $y$ is the response variable and $x$ the predictor variable. Notice the structure I(..) is needed for R to signify a factor of the form $x^{n}$.

Equation	Regression Formula
$y=a+bx$	`y ~ 1 + x`
$y=a$	`y ~ 1`
$y=bx$	`y ~ -1+x`
$y=a+bx+cx^{2}$	`y ~ 1 + x + I(x^2)`
$y=a+bx+cx^{2}+dx^{3}$	`y~ 1 + x + I(x^2) + I(x^3)`

8.3.1 Can you linearize your model?

We can estimate parameters for nonlinear models in cases where the function can be transformed mathematically to a linear equation. Here is one example: while the equation $y=ae^{bx}$ is nonlinear with respect to the parameters, it can be made linear by a logarithmic transformation of the data:

\[ \ln(y) = \ln(ae^{bx}) = \ln(a) + \ln (e^{bx}) = \ln(a) + bx \]

The advantage to this approach is that the growth rate parameter $b$ is easily identifiable from the data, and the value of $a$ is found by exponentiation of the fitted intercept value. The disadvantage is that you need to do a log transform of the $y$ variable first before doing any fits.

Example 8.1 A common nonlinear equation in enzyme kinetics is the Michaelis-Menten law, which states that the rate of the uptake of a substrate $V$ is given by the equation:

\[ V = \frac{V_{max} s}{s+K_{m}}, \]

where $s$ is the amount of substrate, $K_{m}$ is half-saturation constant, and $V_{max}$ the maximum reaction rate. (Typically $V$ is used to signify the “velocity” of the reaction.)

Consider you have the following data (from Keener and Sneyd (2009)):

Table 8.3: Substrate (s) and reaction rate data (V) from Keener and Sneyd (2009).

s (mM)	0.1	0.2	0.5	1.0	2.0	3.5	5.0
V (mM / s)	0.04	0.08	0.17	0.24	0.32	0.39	0.42

Apply parameter estimation techniques to estimate $K_{m}$ and $V_{max}$ and plot the resulting fitting curve with the data.

Solution 8.1. The first thing that we will need to do is to define a data frame (tibble) for these data:

enzyme_data <- tibble(
  s = c(0.1, 0.2, 0.5, 1.0, 2.0, 3.5, 5.0),
  V = c(0.04, 0.08, 0.17, 0.24, 0.32, 0.39, 0.42)
)

Figure 8.2 shows a plot of $s$ and $V$:

Figure 8.2: Scatterplot of enzyme substrate data from Example 8.1.

Figure 8.2 definitely suggests a nonlinear relationship between $s$ and $V$. To dig a little deeper, try running the following code that plots the reciprocal of $s$ and the reciprocal of $V$ (do this on your own):

ggplot(data = enzyme_data) +
  geom_point(aes(x = 1 / s, y = 1 / V),
    color = "red",
    size = 2
  ) +
  labs(
    x = "1/s (1/mM)",
    y = "1/V (s / mM)"
  )

Notice how easy it was to plot the reciprocals of $s$ and $V$ inside the ggplot command. Here’s how to see this with the provided equation (and a little bit of algebra):

\[ V = \frac{V_{max} s}{s+K_{m}} \rightarrow \frac{1}{V} = \frac{s+K_{m}}{V_{max} s} = \frac{1}{V_{max}} + \frac{1}{s} \frac{K_{m}}{V_{max}} \]

In order to do a linear fit to the transformed data we will use the regression formulas defined above and the handy structure I(VARIABLE) and plot the transformed data with the fitted equation (do this on your own as well):⁴

# Define the regression formula
enzyme_formula <- I(1 / V) ~ 1 + I(1 / s)

# Apply the linear fit
enzyme_fit <- lm(enzyme_formula,data = enzyme_data)

# Show best fit parameters
summary(enzyme_fit)

# Add fitted data to the model
fit_data <- predict(enzyme_fit, data = enzyme_data)
enzyme_data_model <- cbind(enzyme_data, .fitted = fit_data)


# Compare fitted model to the data
ggplot(data = enzyme_data) +
  geom_point(aes(x = 1 / s, y = 1 / V),
    color = "red",
    size = 2
  ) +
  geom_line(
    data = enzyme_data_model,
    aes(x = 1 / s, y = .fitted)
  ) +
  labs(
    x = "1/s (1/mM)",
    y = "1/V (s / mM)"
  )

In Exercise 8.6 you will use the coefficients from your linear fit to determine $V_{max}$ and $K_{m}$. When plotting the fitted model values with the original data (Figure 8.3), we need to take the reciprocal of the column .fitted when we apply cbind because the response variable in the linear model is $1 / V$ (confusing, I know!). For convenience, the code that does the all the fitting is shown below:⁵

# Define the regression formula
enzyme_formula <- I(1 / V) ~ 1 + I(1 / s)

# Apply the linear fit
enzyme_fit <- lm(enzyme_formula,data = enzyme_data)

# Add fitted data to the model
fit_data <- predict(enzyme_fit, data = enzyme_data)
enzyme_data_model <- cbind(enzyme_data, .fitted = fit_data)

# Compare fitted model to the data
ggplot(data = enzyme_data) +
  geom_point(aes(x = s, y = V),
    color = "red",
    size = 2
  ) +
  geom_line(
    data = enzyme_data_model,
    aes(x = s, y = 1 / .fitted)
  ) +
  labs(
    x = "s (mM)",
    y = "V (mM / s)"
  )

Figure 8.3: Scatterplot of enzyme substrate data from Example 8.1 along with the fitted curve.

8.3.2 Weighted linear regression

For example, let’s return back to Figure 8.3. The Lineweaver-Burk plot has the inverse of two variables $s$ and $V$. As s and V both increases, then 1/s and 1/V both decrease, which may have the unintended consequence of biasing the fit at larger values of s and V. This can be problematic because much of the biochemical activity occur when s and V are low (Figure 8.2).

Another approach in linear regression is to assign weights to individual data points, which increases the influence that a given data point has on the best fit line. This approach is called weighted linear regression or also weighted least squares. To account for this in R, there is an additional option called weights:

# Define the regression formula
enzyme_formula <- I(1 / V) ~ 1 + I(1 / s)

# Apply the linear fit
enzyme_fit_weighted <- lm(enzyme_formula, 
                 data = enzyme_data,
                 weights = enzyme_data$s)

summary(enzyme_fit_weighted)

In Exercise 8.7 you will investigate what happens when the weights are equal to $\sqrt{s}$. While the choice of weights may seem arbitrary, in Chapter 10 we explore the concept of cost functions, which defines a model data residual. As we will see, accounting for uncertainties in individual data points is an exmaple of weighted linear regression.

8.4 Parameter estimation with nonlinear models

In many cases you will not be able to write your model in a linear form by applying functional transformations. Here’s the good news: you can still do a non-linear curve fit using the function nls, which is similar to the command lm with some additional information. Let’s return to an example from Chapter 2 where we examined the weight of a dog named Wilson (Figure 2.2). One model for the dog’s weight $W$ from the days since birth $D$ is a saturating exponential equation (Equation 8.1):

\[ W =f(D,a,b,c)= a - be^{-ct}, \tag{8.1}\]

where we have the parameters $a$, $b$, and $c$. We can apply the command nls (nonlinear least squares) to estimate these parameters. Because nls is an iterative numerical method it needs a starting value for these parameters (here we set $a = 75$, $b=30$, and $c=0.01$). Determining a starting point can be tricky - it does take some trial and error.

# Define the regression formula
wilson_formula <- weight ~ a - b * exp(-c * days)

# Apply the nonlinear fit
nonlinear_fit <- nls(formula = wilson_formula,
  data = wilson,
  start = list(a = 75, b = 30, c = 0.01)
)

# Summarize the fit parameters
summary(nonlinear_fit)


Formula: weight ~ a - b * exp(-c * days)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a 7.523e+01  1.573e+00   47.84  < 2e-16 ***
b 9.077e+01  3.400e+00   26.70 1.07e-14 ***
c 6.031e-03  4.324e-04   13.95 2.26e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.961 on 16 degrees of freedom

Number of iterations to convergence: 11 
Achieved convergence tolerance: 6.889e-06

Similar as before, we can then use cbind to add the fitted data when displaying the fitted curve.

However once you have your fitted model, you can still plot the fitted values with the coefficients (try this out on your own).

# Add fitted data to the model
fit_data <- predict(nonlinear_fit, data = wilson)
wilson_model <- cbind(wilson, .fitted = fit_data)



# Plot the data with the model
ggplot(data = wilson) +
  geom_point(aes(x = days, y = weight),
    color = "red",
    size = 2
  ) +
  geom_line(
    data = wilson_model,
    aes(x = days, y = .fitted)
  ) +
  labs(
    x = "Days since birth",
    y = "Weight (pounds)"
  )

8.5 Towards model-data fusion

The R language (and associated packages) has many excellent tools for parameter estimation and comparing fitted models to data. These tools are handy for first steps in parameter estimation.

More broadly the technique of estimating models from data can also be called data assimilation or model-data fusion. Whatever terminology you happen to use, you are combining the best of both worlds: combining observed measurements with what you expect should happen, given the understanding of the system at hand.

We are going to dig into data assimilation even more - and one key tool is understanding likelihood functions, which we will study in the next chapter.

8.6 Exercises

Exercise 8.1 Determine if the following equations are linear with respect to the parameters. Assume that $y$ is the response variable and $x$ the predictor variable.

$y=a + bx+cx^{2}+dx^{3}$
$y=a \sin (x) + b \cos (x)$
$y = a \sin(bx) + c \cos(dx)$
$y = a + bx + a\cdot b x^{2}$
$y = a e^{-x} + b e^{x}$
$y = a e^{-bx} + c e^{-dx}$

Exercise 8.2 Each of the following equations can be written as linear with respect to the parameters, through applying some elementary transformations to the data. Write each equation as a linear function with respect to the parameters. Assume that $y$ is the response variable and $x$ the predictor variable.

$y=ae^{-bx}$
$y=(a+bx)^{2}$
$\displaystyle y = \frac{1}{a+bx}$
$y = c x^{n}$

Exercise 8.3 Use the dataset global_temperature and the function lm to answer the following questions:

Complete the following table, which represents various regression fits to global temperature anomaly $T$ (in degrees Celsius) and years since 1880 (denoted by $Y$). In the table Coefficients represent the values of the parameters $a$, $b$, $c$, etc. from your fitted equation; P = number of parameters; RSE = Residual standard error.

Equation	Coefficients	P	RSE
$T=a+bY$
$T=a+bY+cY^{2}$
$T=a+bY+cY^{2}+dY^{3}$
$T=a+bY+cY^{2}+dY^{3}+eY^{4}$
$T=a+bY+cY^{2}+dY^{3}+eY^{4}+fY^{5}$
$T=a+bY+cY^{2}+dY^{3}+eY^{4}+fY^{5}+gY^{6}$

After making this table, choose the polynomial of the function that you believe fits the data best. Provide reasoning and explanation why you chose the polynomial that you did.
Finally show the plot of your selected polynomial with the data.

Exercise 8.4 An equation that relates a consumer’s nutrient content (denoted as $y$) to the nutrient content of food (denoted as $x$) is given by: $\displaystyle y = c x^{1/\theta},$ where $\theta \geq 1$ and $c>0$ are both constants.

Use the dataset phosphorous to make a scatterplot with algae as the predictor (independent) variable and daphnia the response (dependent) variable.
Show that you can linearize the equation $\displaystyle y = c x^{1/\theta}$ with logarithms.
Determine a linear regression fit for your new linear equation.
Determine the value of $c$ and $\theta$ in the original equation with the parameters from the linear fit.

Exercise 8.5 Similar to Exercise 8.4, do a non-linear least squares fit for the dataset phosphorous to the equation $\displaystyle y = c x^{1/\theta}$. For a starting point, you may use the values of $c$ and $\theta$ from Exercise 8.4. Then make a plot of the original phosphorous data with the fitted model results.

Exercise 8.6 Example 8.1 guided you through the process to linearize the following equation:

\[ V = \frac{V_{max} s}{s+K_{m}}, \]

where $s$ is the amount of substrate, $K_{m}$ is half-saturation constant, and $V_{max}$ the maximum reaction rate. (Typically $V$ is used to signify the “velocity” of the reaction.) When doing a fit of the reciprocal of $s$ with the reciprocal of $V$, what are the resulting values of $V_{max}$ and $K_{m}$?

Exercise 8.7 Fit a linear model to the enzyme data (the data table enzyme_data) in Example 8.1 using a linear model with $1/s$ and $1/V$ in two ways:

A regular linear model (lm)
A linear model (lm) using the option weights=sqrt(enzyme_data$s) ($\sqrt{s}$)

Generate a plot of both fits to the data and compare paraemter estimates $K_{m}$ and $V_{max}$ in both cases. How does weighting the data affect the resulting fits?

Exercise 8.8 Following from Example 8.1 and Exercise 8.6, apply the command nls to conduct a nonlinear least squares fit of the enzyme data to the equation:

\[ V = \frac{V_{max} s}{s+K_{m}}, \]

where $s$ is the amount of substrate, $K_{m}$ is the half-saturation constant, and $V_{max}$ the maximum reaction rate. As starting points for the nonlinear least squares fit, you may use the values of $K_{m}$ and $V_{max}$ that were determined from Example 8.1. Then make a plot of the actual data with the fitted model curve.

Exercise 8.9 Consider the following data which represent the temperature over the course of a day:

Hour	Temperature
0	54
1	53
2	55
3	54
4	58
5	58
6	61
7	63
8	67
9	66
10	67
11	69
12	68
13	68
14	66
15	67
16	63
17	60
18	59
19	57
20	56
21	53
22	52
23	54
24	53

Make a scatterplot of these data, with the variable on the horizontal axis.
A function that describes these data is $\displaystyle T = A + B \sin \left( \frac{\pi}{12} \cdot H \right) + C \cos \left( \frac{\pi}{12} \cdot H \right)$, where $H$ is the hour and $T$ is the temperature. Explain why this equation is linear for the parameters $A$, $B$, and $C$.
Define a tibble that includes the variables $T$, $\displaystyle \sin \left( \frac{\pi}{12} \cdot H \right)$, $\displaystyle \cos \left( \frac{\pi}{12} \cdot H \right)$.
Do a linear fit on your new data frame to report the values of $A$, $B$, and $C$.
Define a new tibble that has a sequence in $H$ starting at 0 from 24 with at least 100 data points, and a value of $T$ (T_fitted) using your coefficients of $A$, $B$, and $C$.
Add your fitted curve to the scatterplot. How do your fitted values compare to the data?

Exercise 8.10 Use the data from Exercise 8.9 to conduct a nonlinear fit (use the function nls) to the equation $\displaystyle T = A + B \sin \left( \frac{\pi}{12} \cdot H \right) + C \cos \left( \frac{\pi}{12} \cdot H \right)$. Good starting points are $A=50$, $B=1$, and $C=-10$.

Keener, James, and James Sneyd, eds. 2009. Mathematical Physiology. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-75847-3.

We will use R a lot in this chapter to make plots - so please visit Chapter 2 if you need some reminders on plotting in R.↩︎
Data provided by NOAA: https://climate.nasa.gov/vital-signs/global-temperature/.↩︎
An R package called broom also takes fitted model output in a tidy data format with the function augment. You can read more about broom from its documentation.↩︎
The process outlined here forms a Lineweaver-Burk plot.↩︎
You may notice that in Figure 8.3 the fitted curve seems to look like a piecewise linear function. This is mainly due to the distribution of data - if you have smaller gaps between measurements, the fitted curve looks smoother.↩︎

Equation	Coefficients	P	RSE
\(T=a+bY\)
\(T=a+bY+cY^{2}\)
\(T=a+bY+cY^{2}+dY^{3}\)
\(T=a+bY+cY^{2}+dY^{3}+eY^{4}\)
\(T=a+bY+cY^{2}+dY^{3}+eY^{4}+fY^{5}\)
\(T=a+bY+cY^{2}+dY^{3}+eY^{4}+fY^{5}+gY^{6}\)