2.4 Working with R: variables, data frames, and datasets

2.4.1 Creating variables

The next thing we will want to do is to define variables that are stored locally. This is pretty easy to do:

my_result <- 4 + 9

The symbol <- is assignment (you can use equals (=), but it is good coding practice to use the arrow for assignment). Notice how I named the variable called my_result. Generally I prefer using descriptive names for variables for the context at hand (In other words, x would be an odd choice - too ambiguous.) I also used snake case to string together multiple words. In practice you can use snake case, or alphabetic cases (myResult) or even my.result (although that may not be preferred practice in the long run). However, you can’t use my-result because it looks like subtraction between variables my and result.

Once we have defined a variable, we can compute with it. For example 10*my_result should yield 130. Cool, no?

As an example, let’s define a sequence, spaced from 0 to 5 with spacing of 0.05. Store this in a variable called my_sequence. To do this we use the seq command and requires the starting value, ending value, and step size:

my_sequence <- seq(from = 0, to = 5, by = 0.05)

The format for the function seq is seq(from=start,to=end,by=step_size). The seq command is a pretty flexible - there are alternative ways you can generate a sequence by specifying the starting and the end values along with the number of points. If you want to know more about seq you can always use ? followed by the command - that will bring up the help values:

?seq

Once you get more comfortable with syntax in R, you will see that seq(0,5,0.5) gives the same result as seq(from=0,to=5,by=0.05), but it is helpful to write your code so that you can understand what it does.

2.4.2 Data frames

A key structure in R is that of a data frame, which allows different types of data to be collected together. A data frame is like a spreadsheet where each column is a value and each row a value (much like you would find in a spreadsheet), as given in Table 2.1.

Table 2.1: A data frame
mpg disp
Mazda RX4 21.0 160
Mazda RX4 Wag 21.0 160
Datsun 710 22.8 108
Hornet 4 Drive 21.4 258
Hornet Sportabout 18.7 360

Table 2.1 shows the miles per gallon in one column (the variable mpg and the engine size (the variable disp) for different types of cars. The row names (Mazda RX4) just tell you the type of the car. Sometimes row names are not shown.

Another data frame may list solutions to a differential equation, like we did with our three infection models in Section 1 (Table 2.2).

Table 2.2: Model solutions
time model_1 model_2 model_3
0.000000 5.000000 5.0000 5.000000
6.060606 5.996981 669.1571 5.995486
12.121212 7.192755 1222.9000 7.188814
18.181818 8.626962 1684.5848 8.619147
24.242424 10.347145 2069.5158 10.333332

Data frames are an example of tidy data, where each row is an observation, each column a variable (which can be quantitative or categorical). There are several different ways to define a data frame in R. I am going to rely on the approach utilized by the tidyverse, which calls data frames tibbles. So for example, here is I am going to define a data frame that computes the quadratic function \(y=3x^2-2x\) for \(-5 \leq x \leq 2\).

x <- seq(from = -5, to = 2, by = 0.05)
y <- 3 * x^2 - 2 * x

my_data <- tibble(
  x = x,
  y = y
) # Notice I sam specifically defining x and y

Notice that the data frame my_data uses the column (variable) names of x and y. You could have also used tibble(x,y), but it is helpful to name the columns in the way that you would like them to be named.

2.4.3 Reading in datasets

R has a lot of built in datasets! In fact to see all the datasets, type data() at the console. This will popup a new window in RStudio with the names. Take some time exploring them. So cool!

If you want to see the datasets for a specific package (such as demodelr) you type data(package = "demodelr") at the console.

Perhaps what is most important is being able to read in datasets provided to you. Data come in several different types of formats, but one of the more versatile ones are csv (comma separated values). What you need to do is the following:

  • Where you have your .Rproj file located, create a folder called data or datasets
  • Save the file locally on your computer. Take note where you have it saved on your computer, and drag the file to your data folder.
  • To read in the file you will use the command read_csv, which has the following structure:
in_data <- read_csv(FILENAME)

The data gets assigned to the variable in_data (You can call this variable what you want.) For example I have the following csv file of ebola data, which I read in via the following:

ebola <- read_csv("data/ebola.csv")

Notice the quotes around the FILENAME. Pro tip: If you have the data files in the data folder, in RStudio you can type “data” and it may start to autocomplete - this is hand (you can also use tab.)