Introduction to R

Statistical Genetics Lab
Department of Genetics
Luiz de Queiroz College of Agriculture
University of São Paulo

2022-11-25

## Warning: package 'knitr' was built under R version 4.1.3
## Warning: package 'rmarkdown' was built under R version 4.1.3

R is a language and environment for statistical computing and graphics. To download R, please visit the Comprehensive R Archive Network. You do not need to be an expert on it to be able to build your linkage map using OneMap.

Although we prefer and recommend the Linux version, in this tutorial it is assumed that the user is running Windows. Users of R under Linux or Mac OS should have no difficulty following this tutorial.

We would like to recommend those new users, instead of using plain R, use it through the fantastic software RStudio. With this package, there is no noticeable difference between operating systems.

As advertised on the website, RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, and workspace management. In other words, it offers a number of facilities for your convenience that will make your life easier, especially if you have never used R before.

So, go ahead and download and install R and RStudio. The window on the left is where you type the R commands you want.

Getting started

In the left window, you can see a greater than sign (``>’’), which means that R is waiting for a command. We call this a prompt.

Let us start with a simple example of adding two numbers. Type 2 + 3 at the prompt then type the Enter key. You will see the result directly on the screen.

2 + 3
#> [1] 5

You can store this result into a variable for future use, applying the assignment operator _ <- _ (less than sign and _ minus_ altogether):

x <- 2 + 3

The result of the calculation was stored into the variable x. You can access this result by typing x at the prompt:

x
#> [1] 5

You can also use the variable x in other calculations. For example:

x + 4
#> [1] 9

So, play a little just to start understanding what is going on.

Functions

Another fundamental aspect in R is the usage of functions. A function is a predefined routine used to do specific calculations. For example, to calculate the natural logarithm of \(6.7\), we can use the function log:

log(6.7)
#> [1] 1.902108

The function log contains a group of internal procedures to calculate the natural logarithm of a positive real number. The input values of a function are called arguments.

In the previous example, we provided only one argument (\(6.7\)) to the function. Sometimes a function has more than one argument. For example, to obtain the logarithm of \(6.7\) to base \(4\), you can use:

log(6.7, base = 4)
#> [1] 1.372081

It is possible to calculate the natural logarithm of a set of numbers by defining a vector and using it as the first argument of the function log. To do so we use the function c(), that combines a set of values into a vector. Thus, to calculate the logarithm of the numbers 6.7, 3.2, 5.4, 8.1, 4.9, 9.7, and 2.5, one can use:

y <- c(6.7, 3.2, 5.4, 8.1, 4.9, 9.7, 2.5)
log(y)
#> [1] 1.9021075 1.1631508 1.6863990 2.0918641 1.5892352 2.2721259 0.9162907

Notice that y is a vector, that is the argument to the function log().

Getting help

Every R function has a help page that can be accessed using a question mark before the name of the function. For example, to get help on function log, you would type:

?log

This command will open a help page in the default web browser of your system. The help page contains some important information about the function such as its syntax, its arguments, and some usage examples.

There are many other ways of getting help, of course. For example, from RStudio, click Help on the menu. For doing searches on the internet, it is better to first go to https://rseek.org/, since R is a very common letter to include in searches.

Packages

Although R has a huge amount of internal functions, for doing very specific computations (like constructing genetic linkage maps), it is necessary to add extra functionalities. These can be done by installing a package (that, loosely speaking, will include a number of new functions for helping you to achieve what you are trying to do). A package is a collection of related functions, help files and example data files that have been bundled together (Adler, 2010).

For example, let us assume that you need to convert a set of recombination fractions into centimorgan distance using the Kosambi mapping function. One possible way to do this is by using basic R to write a function to calculate the distances. Another way is to use the OneMap package. To install it you can type:

setRepositories(ind = 1:2)
install.packages("onemap")

You also can use the console menus on RStudio. On the bottom window to the right, select Packages, then Install, and finally select OneMap (select CRAN as your repository). Yes, it is that easy!

Returning to the console, you need to load OneMap by typing:

library(onemap)

Some Linux users reported the error message below:

ERROR: dependency ‘tkrplot’ is not available for package ‘onemap’

To fix it, in a terminal (outside R), install r-cran-tkrplot:

sudo apt-get install r-cran-tkrplot

To finish our example, let us enter some recombination fractions, for example, 0.01, 0.12, 0.05, 0.11, 0.21, 0.07, and save it into a variable named rf:

rf <- c(0.01, 0.12, 0.05, 0.11, 0.21, 0.07)

Now, let us use OneMap’s function kosambi to do the calculation:

kosambi(rf)
#> [1]  1.000133 12.238706  5.016767 11.182805 22.384601  7.046279

You can also obtain help on the function kosambi using the question mark in the same way as done before:

?kosambi

Importing and exporting data

So far, we have entered the variables in R by typing them directly into the console. However, in real situations, we usually read these values from a file or a data bank (including files on the internet).

To learn this procedure, copy and paste the following table into a text editor (for example, notepad) and save it to a file called test.txt into any directory in your computer (such as My Documents).

    x       y
 2.13    4.50
 4.48    1.98
10.95    9.29
10.03   16.25
12.72   27.38
24.63   22.60
22.57   36.87
29.78   31.73
19.54   10.42
 7.86   14.68
11.75    8.68
23.71   37.39

To read these data set into R, first, you have to set the working directory. Go to Session, then Set Working Directory, and Choose Directory, pointing to where you saved the file test.txt.

Now let us read the file test.txt into R and store it in a variable named dat. To do this, we can use using the R function read.table. The first argument is the name of the file; the second one indicates if the file contains a header, that is, if the first line of the file contains the names of the variables (which is true for our example):

dat <- read.table(file = "test.txt", header = TRUE)
dat

The second line, with dat, is necessary to ask R to print the contents of the object dat (i. e., the data itself). Inspecting the object dat you can see a table with 12 rows and two columns. The names of the columns are x and y. We can access the variables in columns using the dollar sign followed by the column name:

dat$x
#>  [1]  2.13  4.48 10.95 10.03 12.72 24.63 22.57 29.78 19.54  7.86 11.75 23.71
dat$y 
#>  [1]  4.50  1.98  9.29 16.25 27.38 22.60 36.87 31.73 10.42 14.68  8.68 37.39

It is also possible to use the function summary to extract some information about the object dat, or about each one of the columns separately:

summary(dat)
#>        x                y         
#>  Min.   : 2.130   Min.   : 1.980  
#>  1st Qu.: 9.488   1st Qu.: 9.137  
#>  Median :12.235   Median :15.465  
#>  Mean   :15.012   Mean   :18.481  
#>  3rd Qu.:22.855   3rd Qu.:28.468  
#>  Max.   :29.780   Max.   :37.390
summary(dat$x)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.130   9.488  12.235  15.012  22.855  29.780
summary(dat$y)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.980   9.137  15.465  18.481  28.468  37.390

The function summary provides some descriptive statistics about the variables in the dataset. If you want to export this information to a file you can use the function write.table:

write.table(x = summary(dat), file = "test_sum.txt", quote = FALSE)

The first argument is the output of the summary function. Note that is possible to use a function as an argument of another one. The second argument is the name of the file in which the summary will be written. Notice that this will happen in the working directory, previously set through RStudio menus. The third argument eliminates double quotes from the output file. After running the command, you can look for the file test_sum.txt in the working directory you defined before.

Classes and methods

In R, every object belongs to a class. This is a simple concept that you must remember. For example, the dat object mentioned above belongs to class data.frame. We can obtain this information using the function class:

class(dat)
#> [1] "data.frame"

When we used the function summary, it automatically recognized the class of the object dat and applied a specific procedure developed for class data.frame, which in this case involves the computation of some descriptive statistics.

This procedure is named method. However, other classes of objects can be used as arguments to function summary and the result will be different!

For example, let us adjust a linear (regression) model using column y as the response variable, and column x as the independent one. This can be done with the function lm():

ft_mod <- lm(dat$y ~ dat$x)
ft_mod
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x)
#> 
#> Coefficients:
#> (Intercept)        dat$x  
#>       1.803        1.111

This function is used to fit linear models and, by default, prints just a formula and the coefficients of the linear regression. Object ft_mod is of class lm:

class(ft_mod)
#> [1] "lm"

So, if we use function summary to obtain more information about the fitted model, the result will be:

summary(ft_mod)
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -13.091  -5.144  -1.413   5.421  11.446 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   1.8026     4.7689   0.378  0.71334   
#> dat$x         1.1110     0.2771   4.009  0.00248 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.075 on 10 degrees of freedom
#> Multiple R-squared:  0.6164, Adjusted R-squared:  0.5781 
#> F-statistic: 16.07 on 1 and 10 DF,  p-value: 0.002482

In this case, function summary recognizes ft_mod as an object of class lm and applies a method that shows information about the fitted model, such as the distribution of the residuals, regression coefficients, t-tests, and the coefficient of determination (\(R^2\)), etc.

Thus, it is possible to use the same function on different classes of objects to obtain different results. This concept is very important in OneMap and you must remember it to use the package. For example, in other vignettes, we will show that depending on the class of the dataset, which can be outcross, f2, backcross, riself and risib, a certain set of procedures will be applied. Not by coincidence, these classes correspond to all types of populations that can be analyzed. The advantage of this approach is that you do not need to change the function to do a specific analysis; it will recognize the object type and will adapt accordingly.

Saving your work

Finally, you may need to save your work to come back to it in another working session. But before we explain how to do that, let us explain a few other concepts.

You can save your R Script, which is the file that has all R instructions you typed so far. You can later load them and run all instructions again to get the same results. This is easy: just click File, Save As, and choose a directory and a name (usually with the extension .R, such as Example1.R, etc).

A different thing is to save your R Session, with all objects you created so far (called R Workspace). This is not the same, because once you load the workspace, you will have all the objects already loaded, not requiring you to do everything again, i. e, running your script. This will help you to save a lot of time since some of the analyses required to build linkage maps are time demanding.

To do so, click Session, then Save Workspace As and choose a directory and name. In your next session, open RStudio and then go to Session, Load Workspace.

Alternatively, you can do that using the R function save.image, For example, if you want to save your analysis in a file named myworkspace.RData, you should use:

save.image("myworkspace.RData")

To load:

load("myworkspace.RData")

References

N. Matloff, The Art of R Programming. 2011. 1st ed. San Francisco, CA: No Starch Press, Inc., 404 pages.

Adler, J. R. 2009. R in a Nutshell. A Desktop Quick Reference. O’Reilly Media.