2.4 Working with R: variables, data frames, and datasets

2.4.1 Creating variables

The next thing we will want to do is to define variables that are stored locally. This is pretty easy to do:

my_result <- 4 + 9

The symbol <- is assignment (you can use equals (=), but it is good coding practice to use the arrow for assignment). Notice how I named the variable called my_result. Generally I prefer using descriptive names for variables for the context at hand (In other words, x would be an odd choice - too ambiguous.) I also used snake case to string together multiple words. In practice you can use snake case, or alphabetic cases (myResult) or even my.result (although that may not be preferred practice in the long run). However, you can’t use my-result because it looks like subtraction between variables my and result.

Once we have defined a variable, we can compute with it. For example 10*my_result should yield 130. Cool, no?

As an example, let’s define a sequence, spaced from 0 to 5 with spacing of 0.05. Store this in a variable called my_sequence. To do this we use the seq command and requires the starting value, ending value, and step size:

my_sequence <- seq(from = 0, to = 5, by = 0.05)

The format for the function seq is seq(from=start,to=end,by=step_size). The seq command is a pretty flexible - there are alternative ways you can generate a sequence by specifying the starting and the end values along with the number of points. If you want to know more about seq you can always use ? followed by the command - that will bring up the help values:

?seq

Once you get more comfortable with syntax in R, you will see that seq(0,5,0.5) gives the same result as seq(from=0,to=5,by=0.05), but it is helpful to write your code so that you can understand what it does.

2.4.2 Data frames

A key structure in R is that of a data frame, which allows different types of data to be collected together. A data frame is like a spreadsheet where each column is a value and each row a value (much like you would find in a spreadsheet), as given in Table 2.1.

Table 2.1: A data frame
	mpg	disp
Mazda RX4	21.0	160
Mazda RX4 Wag	21.0	160
Datsun 710	22.8	108
Hornet 4 Drive	21.4	258
Hornet Sportabout	18.7	360

Table 2.1 shows the miles per gallon in one column (the variable mpg and the engine size (the variable disp) for different types of cars. The row names (Mazda RX4) just tell you the type of the car. Sometimes row names are not shown.

Another data frame may list solutions to a differential equation, like we did with our three infection models in Section 1 (Table 2.2).

Table 2.2: Model solutions
time	model_1	model_2	model_3
0.000000	5.000000	5.0000	5.000000
6.060606	5.996981	669.1571	5.995486
12.121212	7.192755	1222.9000	7.188814
18.181818	8.626962	1684.5848	8.619147
24.242424	10.347145	2069.5158	10.333332

Data frames are an example of tidy data, where each row is an observation, each column a variable (which can be quantitative or categorical). There are several different ways to define a data frame in R. I am going to rely on the approach utilized by the tidyverse, which calls data frames tibbles. So for example, here is I am going to define a data frame that computes the quadratic function $y=3x^2-2x$ for $-5 \leq x \leq 2$ .

x <- seq(from = -5, to = 2, by = 0.05)
y <- 3 * x^2 - 2 * x

my_data <- tibble(
  x = x,
  y = y
) # Notice I sam specifically defining x and y

Notice that the data frame my_data uses the column (variable) names of x and y. You could have also used tibble(x,y), but it is helpful to name the columns in the way that you would like them to be named.

2.4.3 Reading in datasets

R has a lot of built in datasets! In fact to see all the datasets, type data() at the console. This will popup a new window in RStudio with the names. Take some time exploring them. So cool!

If you want to see the datasets for a specific package (such as demodelr) you type data(package = "demodelr") at the console.

Perhaps what is most important is being able to read in datasets provided to you. Data come in several different types of formats, but one of the more versatile ones are csv (comma separated values). What you need to do is the following:

Where you have your .Rproj file located, create a folder called data or datasets
Save the file locally on your computer. Take note where you have it saved on your computer, and drag the file to your data folder.
To read in the file you will use the command read_csv, which has the following structure:

in_data <- read_csv(FILENAME)

The data gets assigned to the variable in_data (You can call this variable what you want.) For example I have the following csv file of ebola data, which I read in via the following:

ebola <- read_csv("data/ebola.csv")

Notice the quotes around the FILENAME. Pro tip: If you have the data files in the data folder, in RStudio you can type “data” and it may start to autocomplete - this is hand (you can also use tab.)