11.4 Exercises

Exercise 11.1 Histograms are an important visualization tool in descriptive statistics. Read the following essays on histograms, and then summarize 2-3 important points of what you learned reading these articles.

Exercise 11.2 Average snow cover from 1970 - 1979 in October over Eurasia (in million km\(^{2}\)) were reported as the following:

\[\begin{equation*} \{6.5, 12.0, 14.9, 10.0, 10.7, 7.9, 21.9, 12.5, 14.5, 9.2\} \end{equation*}\]

Create a histogram for these data.
Compute the sample mean and median of this dataset.
What would you report as a representative or typical value of snow cover for October? Why?
The 21.9 measurement looks like an outlier. What is the sample mean excluding that measurement?

Exercise 11.3 Consider the equation \(\displaystyle S(\theta)=(3-1.5^{1/\theta})^{2}\). This function is an idealized example for the cost function in Figure 11.1.

What is \(S'(\theta)\)?
Make a plot of \(S'(\theta)\). What are the locations of the critical points?
Algebraically solve \(S'(\theta)=0\). Does your compute critical point match up with the graph?

Exercise 11.4 Repeat the bootstrap sample for the precipitation dataset where the number of bootstrap samples is 1000 and 10000. Report the median and confidence intervals for the mean and the standard deviation of \(R\). What do you notice as the number of bootstrap samples increases?

Exercise 11.5 Using the data in Exercise 11.2, do a bootstrap sample with \(N=1000\) to compute the a bootstrap estimate for the mean and the 95% confidence interval for October snowfall cover in Eurasia.

Exercise 11.6 We computed the 95% confidence interval using the quantile command. An alternative approach to summarize a distribution is with the summary command. Here is the output for the summary command for a dataframe:

knitr::include_graphics(“figures/11-bootstrap/summary-output-11.png”)

We call this command using summary(data_frame), where data_frame is the particular dataframe you want to summarize. The output reports the minimum and maximum values of a dataset. The output 1st Qu. and 3rd Qu. are the 25th and 75th percentiles.

Do 1000 bootstrap samples using the data in Exercise 11.2 and report the output from the summary command.

Exercise 11.7 The dataset snowfall lists the snowfall data from a snowstorm that came through the Twin Cities on April 14, 2018.

Make an appropriately sized histogram for the snowfall observations.
What is the mean snowfall?
Do a bootstrap estimate with \(N=100\) and \(N=1000\) and plot their respective histograms.
For each of your bootstrap samples (\(N=100\) and \(N=1000\) and compute the mean and 95% confidence interval for the bootstrap distribution.
What would you report for the mean and 95% confidence interval for this snowstorm?

Exercise 11.8 This question tackles the dataset global_temperature to determine plausible models for a relationship between time and average global temperature. For this exercise we are going to look the variability in bootstrap estimates for models up to fourth degree.

Using the function bootstrap_model, generate a bootstrap sample of \(n=1000\) for each of the following functions.

Linear: \(T=a+bY\)
Quadratic: \(T=a+bY+cY^{2}\)
Cubic: \(T=a+bY+cY^{2}+dY^{3}\)
Quartic: \(T=a+bY+cY^{2}+dY^{3}+eY^{4}\)

Your solution should include graphs of the data with the bootstrap predictions and the prediction from the linear regression model. How does the variability in the parameters change (\(a,b,c,d,e\)) as more terms in the model are added? How does the variability in the bootstrap predictions change as more terms in the model are added?

Exercise 11.9 Similar to the problems we have worked with before, the equation that relates a consumer’s nutrient content (denoted as \(y\)) to the nutrient content of food (denoted as \(x\)) is given by: \(\displaystyle y = c x^{1/\theta}\), where \(\theta \geq 1\) and \(c\) are both constants is a constant. We will be using the dataset phosphorous.

Do 1000 bootstrap samples for this dataset.
To find \(c\) and \(\theta\) we can apply logrithms to express this as a linear equation of this equation. (See Exercise @ref{exr:log-linear-08}). Do a linear model fit for this log-transformed equation.
Generate histograms for bootstrap-fitted parameters for your log-transformed equation.
What are the median and 95% confidence intervals for the bootstrap-fitted parameters?
Using the function bootstrap_model, generate a bootstrap sample of \(n=1000\) for the linear (log transformed) equation.
Translate these bootstrap confidence intervals of your fitted slope and intercept back into the values of \(c\) and \(\theta\).
These confidence intervals seem pretty large. What would be some strategies we could employ to narrow these confidence intervals?