---
title: '*RMeDPower2*: helping design and analyses of data from repeated measures experiments'
author:
- name: "Min-Gyoung Shin"
  affiliation: Gladstone Institutes
  email: mingyoung.shin@gladstone.ucsf.edu
- name: Reuben Thomas
  affiliation: Gladstone Institutes
  email: reuben.thomas@gladstone.ucsf.edu
package: RMeDPower2
output:
  html_document:
    toc: true
    toc_float: true
    number_sections: true
  pdf_document:
    latex_engine: xelatex
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{RMeDPower2 Tutorial}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
abstract: "Biomedical research very often involves data generated from repeated measures
  experiments. RMeDPower2 is an R package that provides complete functionality to
  analyse data coming from repeated measures experiments, i.e., where one has repeated
  measures from the same biological/independent units or samples. RMeDPower2 helps
  test the modeling assumptions one makes, identify outlier observations, outlier
  units at different levels of the design, estimates statistical power or perform
  sample size calculations, estimate parameters of interest and also to visualize
  the association being tested. The functionality is limited to testing associations
  of one predictor (continuous or categorical, e.g., disease status or brain pathology)
  along with one another covariate (e.g., gender status) in the context of hierarchical
  or crossed experimental designs. This vignette illustrates the use of this package
  in multiple contexts relevant to biomedical research. \n\nRMeDPower2 defines the
  experimental design for the data, probability model of the data generating distribution
  and necessary parameters required for sample size calculation using convenient S4
  class objects. It uses these objects in one framework that brings together the functionality
  implemented in multiple R packages - lme4 (implementation of linear mixed effects
  models), influence.ME (identification of outlier units), EnvStats (identification
  of outlier observations), DHARMa (testing of modeling assumptions for non-normal
  distributions), simr (sample-size calculations) and tidyverse (data manipulation
  and visualization).\n"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

# Getting started
```{r echo=FALSE}
template_dir <- system.file("input_templates/cell_assay_data",
                            package = "RMeDPower2")
```
## Installing `RMeDPower2`
`RMeDPower2` can be installed using the following commands
```{r eval=FALSE}
install.packages("devtools")
library(devtools)
install_github('gladstone-institutes/RMeDPower2', build_vignettes=TRUE, force = TRUE)
```

## Quick tutorial
The workflow can be accomplished in 5 main steps as described in the sections below (1.2.1-.2.5).

### Load libraries and read input data
The first step is to load RMeDPower2 and read-in the input data as a data frame. This is accomplished by the following commands:

```{r results='hide'}
suppressPackageStartupMessages(library(RMeDPower2))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))

data <- plate_assay_pilot_data
```

Example data has experiment, line as experience columns, classification as the condition column, and cell size as the response column.

```{r}
knitr::kable(data[1:5,] )
```


### Specify design, model and power parameters

The following set of commands allows users to specify parameters for experimental design, assumed probability models and distributions, and power simulations. 

1. Use a text editor to modify the template jsonfile capturing the design for the data and read in this file.

```{r}
design <- readDesign(file.path(template_dir,"design_cell_assay.json"))
```

2. Use a text editor to modify the template jsonfile capturing the error probability model and read in this file.

```{r}
model <- readProbabilityModel(file.path(template_dir,"prob_model.json"))
```

3. Use a text editor to modify the template jsonfile capturing the power parameters and read in this file.

```{r}
power_param <- readPowerParams(file.path(template_dir,"power_param.json"))
```


### Diagnose data and model

The next set of commands will allow the user to evaluate the datatype, examine if the model is appropriate and identity outliers. 

1. To diagnose the data and model, test model assumptions, identify outlier observations, outlier biological/independent units.

```{r eval=FALSE}
diagnose_res <- diagnoseDataModel(data, design, model)
data_update = diagnose_res$Data_updated
```

2. Modify design objects (e.g., remove outliers), model objects (e.g., change probability distribution assumption) if needed based on output of step 1 above.

```{r eval=FALSE}
diagnose_res_update <- diagnoseDataModel(data_update, design_update, model_update)
```

3. Repeat steps 1 and 2 until the models assumptions are met. For example, users can choose to remove outlier-experiments so that the distribution is close to normal.


### Statistical power estimation

This section of code will allow the user to run sample-size calculations from the pilot input data. 

```{r eval=FALSE}
power_res <- calculatePower(data, design, model, power_param)
```

### Estimate parameters of interest

Once we have run the models, the user can estimate and visualize parameters of interest in order to investigate the association between response variable and condition variables.

```{r eval=FALSE}
estimate_res <- getEstimatesOfInterest(data, design, model, print_plots = F)
```


## Input data

Users will have to ensure that the input data is organized in a format that will allow RMedPower2 to read in the data. The data should be structured in a table with rows representing the  individual observations or responses and columns representing not only the corresponding explanatory variables but also other variables that capture the hierarchical or crossed design of how the data was generated. It is important that the columns have names or headers - these column names will be used to define the relevant S4 class objects. 

For example, the next below illustrates the input data from a cellular assay where each of the 2,588 observations represents a measurement of size (`cell_size2`) of a single cell, from a given cell line (`line`) that is either from a control or disease condition (`classification`) and is measured in a given experimental batch (`experiment`).

```{r echo=FALSE}
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(library(RMeDPower2))
template_dir <- system.file("input_templates/cell_assay_data",
                            package = "RMeDPower2")

data <- plate_assay_pilot_data %>%
  select(experiment, line, classification, cell_size2) %>%
  mutate(experiment = as.factor(experiment), line = as.factor(line), classification = as.factor(classification))

str(data)
```


## Three key classes of `RMeDPower2`

Here, users will learn how to set up parameters for experimental designs, probability and distributional assumptions, and power simulations.


### `RMeDesign` S4 class

The underlying design for the data will be stored in an object (R data structure with variables) of this class. The slots of this class are: 

1. `response_column`: character that is the column name in the input data table that captures the responses or individual observations. This has to be set by the user.

2. `covariate`: character that is the column name in the input data table that captures the covariate information. This will be used as a fixed effect in the mixed effects model of the data. This slot is set to `NULL` if there are no covariates.

3. `condition_column`: character that is the column name in the input data table that captures the predictor/independent variable information. This will be used as a fixed effect in the mixed effects model of the data.This has to be set by the user.

4. `experimental_columns`: character vector that represents the column names in the input data table that captures the experimental variables of the design. These will be used as random effects in the mixed effects model of the data. The order of the names in this character vector is important. The first element captures the top hierarchical level of the design, the second element the next level of the hierarchy and so on. The hierarchy is explicitly modeled in the specification of the mixed effects models. The hierarchy will not be modeled for variables specified in the `crossed_columns` slot (see below). This has to be set by the user. 

5. `condition_is_categorical`: logical value equal to TRUE if the variable in the `condition_column` is to be considered as a categorical variable and is equal to FALSE otherwise. Defaults to TRUE.

6. `crossed_columns`: character vector that represents the column names in the input data table that are a subset of the values in `experimental_columns` representing crossed factors. Defaults to `NULL`.

7. `total_column`: character that is the column name in the input table that will be used to offset values in the mixed effects models. Useful for modeling count data. Defaults to `NULL`.
  
8. `outlier_alpha`: numeric value that is the p-value threshold used to identify outlier observations. Defaults to 0.05. 

9. `na_action`: character equal to _complete_ or _unique_. To be used in the context where the input data has multiple responses that will each be sequentially modeled. For example, from a single set of data, a user is gathering numerous observations.  Here we refer to _complete_ in the scenario  where the observed outliers that identified for one response will also be identified as outliers for all the other responses while  _unique_ refers to the situation where the outlier observations identified for one response will only be used for the model of that particular response. Defaults to _complete_. 

10. `include_interaction`: Whether to include condition * covariate interaction

11. `random_slope_variable`: Variable for random slopes (typically "condition_column")

12.  `covariate_is_categorical`: Specify whether the covariate variable is categorical. TRUE: Categorical, FALSE: Continuous.


To create an object of class RMeDesign, the following code is used:

```{r}
design <- new("RMeDesign")
print(design)
```

In practice, for a given data set the design can be input from a jsonfile. The user can use a text editor to modify the `design_template.json` file available with the package to input the relevant information based on the column names of the input data. For example, the design for the cell size data referred to above can be read using the function `readDesign`.

```{r echo=FALSE}
template_dir <- system.file("input_templates/cell_assay_data",
                            package = "RMeDPower2")
```
```{r}
design <- readDesign(file.path(template_dir, "design_cell_assay.json"))
print(design)
```

### `ProbabilityModel` S4 class
The error probability distribution is specified by two slots in objects of the `ProbabilityModel` class.

1. `error_is_non_normal`: Is the logical value that is TRUE if the underlying probability distribution is not assumed to be Normal and is FALSE otherwise.

2. `family_p`: This character value specifies the probability distribution. Accepted values are _poisson_(`=poisson(link="log")`), _binomial_(`=binomial(link="logit")`), _bionomial_log_(`=binomial(link="log")`), _Gamma_log_(`=Gamma(link="log")`), _Gamma_(`=Gamma(link="inverse")`), _negative_binomial_. Defaults to `NULL`.

```{r}
model <- readProbabilityModel(file.path(template_dir, "prob_model.json"))
print(model)
```

### `PowerParams` S4 class

The parameters described below are necessary for sample-size estimation and are stored in the following slots of the `PowerParams` class.

1. `target_columns`: This describes the character vector with column names of experimental variables for which the sample-size estimation is to be performed
2. `power_curve`: This is numeric 1 or 0. 1: Power simulation over a range of sample sizes or levels. 0: Power calculation over a single sample size or a level. Defaults to 1.
3. `nsimn`: The number of simulations to estimate power. Defaults to 10.
4. `levels`: numeric 1 or 0. 1: Amplify the number of corresponding target parameter. 0: Amplify the number of samples from the corresponding target parameter, ex) If target_columns = c("experiment","cell_line") and if you want to expand the number of experiment and sample more cells from each cell line, set levels = c(1,0).
5. max_size: Maximum levels or sample sizes to test. Default if set to `NULL` equals the current level or the current sample size x 5. ex) If max_levels = c(10,5), it will test upto 10 experiments and 5 cell lines.
6. `alpha`: Type I error for sample-size estimation. Defaults to 0.05.
7. `breaks`: Levels /sample sizes of the variable to be specified along the power curve. Default if set to NULL equals max (1, round (the number of current levels / 5)).
8. `effect_size`: A 3 dimensional numeric vector given the proposed effect sizes/parameter estimates for the condition_column, covariate and interaction term between these two variables, respectively in that order. Depending on the model not all elements of the effect_size would make sense. For example, in situations, where there is only the condition column, the second and third elements of effect_size should be equal to NA. If you know the effect size of your condition variable, the effect size can be provided as a parameter. Default set to `NULL`,i.e., if the effect size is not provided then it will be estimated from your data.
9. `icc`: Intra-class coefficients for each of the experimental variables. Used only in the situation when `error_is_non_normal=FALSE` and when the data does not allow variance component estimation.

```{r}
power_param <- readPowerParams(file.path(template_dir, "power_param.json"))
print(power_param)
```

## Three key functions of `RMeDPower2`

The following are functions for users to manipulate data, estimate power, and perform association analysis.

### `diagnoseDataModel(data, design, model)` function

This set of code, tests the model assumptions by using quantile-quantile, residuals vs fitted and residuals versus predicted plots. If needed, the code also transforms the responses if feasible and identifies outlier observations and also outlier experimental units.

### `calculatePower(data, design, model, power_param)` function

This set of code performs sample-size calculations for given experimental design/target variable.

### `getEstimatesOfInterest(data, design, model)` function
This string estimates parameter of interest and also plots the resulting association with predictor of interest using residuals from the model that removes the effects of the modeled experimental variables.

An overview of the anticipated usage of the functionality of this package is shown in the figure below.

```{r flowchart, echo=FALSE, fig.align = "center", fig.cap="Flowchart illustrating the functionality of the package", fig.small=TRUE, echo=FALSE}
knitr::include_graphics(system.file("figures/package_structure.png", package = "RMeDPower2"))
```

# Application examples
Here we illustrate the application of RMeDPower2 using data derived from in vitro assays. We further illustrate the RMeDPower2 across the other domains in biomedical research on our website: https://gladstone-institutes.github.io/RMeDPower2/index.html .
```{r applications, echo=FALSE, fig.align = "center", fig.cap="Flowchart illustrating the functionality of the package", fig.small=TRUE, echo=FALSE}
knitr::include_graphics(system.file("figures/application_domains.png", package = "RMeDPower2"))
```


## Plate assays

iPSC lines from control and patient derived iPSCs from sALS patients were obtained through the Answer ALS (AALS) consortium36 . AALS is the largest repository of ALS patient samples, and contains publicly available patient-specific iPSCs and multi-OMIC data from motor neurons derived from those iPSCs—including RNA Seq, proteomics and whole genome sequence data— all of which is matched with longitudinal clinical data (https://dataportal.answerals.org/home)36. 


We used 6 control lines and 4 sALS lines and differentiated each line into motor neurons (iMNs). iMNs were  transduced with a morphology marker and imaged on day ~25 using RM37-45 as previously
described36. Images were processed and the soma size of neurons were captured using contour ellipse46 function adapted for use in python (https://www.python.org/).

The next step is to read the data and perform data distribution diagnostics, association analysis, and power simulation.

First, we will read the data from the package and json files with specified parameters:

```{r}
template_dir <- system.file("input_templates/cell_assay_data", package = "RMeDPower2")
data <- plate_assay_pilot_data
design <- readDesign(file.path(template_dir,"design_cell_assay.json"))
model <- readProbabilityModel(file.path(template_dir,"prob_model.json"))
power_param <- readPowerParams(file.path(template_dir,"power_param.json"))
```

A visualization of the distribution of cell_size2 across experiments and cell-lines as box-plots and also in terms of empirical cumulative distribution plots suggests the observations have an extremely long tail. We will therefore filter out the top 5 percent of the observations to ensure the results are not skewed by these extreme observations. We also see that the observations in experiment 3 are in general lower than those in the other two experiments.
```{r echo=FALSE}
print(ggplot(data, aes(x=experiment, y=cell_size2, color = line)) +
  geom_boxplot() +
  theme_bw() +
  theme(text = element_text(size = 16)))

print(ggplot(data, aes(x=cell_size2, color = experiment)) +
  stat_ecdf() +
  theme_bw() +
  theme(text = element_text(size = 16)))
```

```{r}
data %<>% filter(cell_size2 < quantile(cell_size2, 0.95))
```

We would like to estimate the difference in the mean cell-sizes across all the observations derived from all experimental batches between the ALS disease lines and the control cell lines (modeled by the _classification_ variable). The first step is run diagnosis on the data and model using the 'diagnoseDataModel' function. The Q-Q plot of the residuals of the LMM model fit to the log-transformed cell-sizes suggests a better fit than the one fit to the cell sizes in the natural scale , assuming that the data generating distribution is based on the Normal distribution. Therefore, we will hereafter model the log-transformed cell sizes. The reasonably horizontal best fit lines for the residuals from this versus the fitted values,  and the square-root of the absolute values of the residuals versus the fitted values, Figure 7e suggest both the LMM model assumptions of linearity and homoscedasticity are reasonable. There were no identified individual outlier observations based on the application of the Rosner’s test to the distribution of the residuals in.


```{r}
diagnose_res <- diagnoseDataModel(data, design, model)
```
The *diagnoseDataModel* function fits  at least one model representing one in the natural scale of the response of interest. Additionally, if individual observations are inferred as outliers using the rosner's test applied to the residuals of the fitted model then another model is fit without these outlier observations. In addition, if the assumed probability distribution of the residuals of the model is normal then a further additional model are fit using logarithm (log) transformed responses and another one if there are outliers in the log-transformed model fit.

To identify all the models fit to the data
```{r}
print(diagnose_res$models)
```
We see that two models were fit to this data - one in the natural scale of cell size and the other in the logarithmic scale. No individual observations were identified as outliers in each of these models.

Let us look the basic QC plots for the models fit in the natural scale of the responses.
```{r output-plots1, results='asis'}
plot_no <- 1
plots <- diagnose_res$diagnostic_plots$natural_scale$plots
captions <- diagnose_res$diagnostic_plots$natural_scale$captions
for (i in seq_along(plots)) {
  print(plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:** %s. %s\n\n", plot_no,names(plots)[i], captions[i]))
  plot_no <- plot_no + 1
}

cat("\n\n\n")
cooks_plots <- diagnose_res$cooks_plots$natural_scale
for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

print("Inferred outlier groups are")
print(diagnose_res$inferred_outlier_groups$natural_scale)
```

Now, let us look at the same set of QC plots for models on the log-transformed responses.
```{r output-plots2, results='asis'}
plots <- diagnose_res$diagnostic_plots$log_scale$plots
captions <- diagnose_res$diagnostic_plots$log_scale$captions
for (i in seq_along(plots)) {
  print(plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:** %s. %s\n\n", plot_no,names(plots)[i], captions[i]))
  plot_no <- plot_no + 1
}


cooks_plots <- diagnose_res$cooks_plots$log_scale

cat("\n\n\n")
for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

print("Inferred outlier groups are")
print(diagnose_res$inferred_outlier_groups$natural_scale)
```

We see normality assumption is better met with the log-transformed model (Figure 6) compared to the model in the natural scale (Figure 1).

The code also outputs Cook's distance measures for each experiment and cell-line that is intended to capture how influential each of these were in determining the final parameter estimates in the LMM model.


The observations in experimental batch 3 and those for cell lines 4 and 9 were identified as outliers using a 4/n (where n is the number of replicates, n=3 for experimental batches and n=10 for cell-lines) threshold applied to the estimates of Cook’s distance for the estimated model parameters. This warranted further investigation on whether or not there was something unusual with the observations made in experimental batch 3 (example: the images from the microscope displayed low signal to noise and may have skewed the measurements) and also for cell line 4# and 9# (example: differentiation runs, genetic background was different from the other lines) that would suggest observations linked to this experiment and cell lines may need to be removed from the analysis. Our investigation did not indicate anything untoward and therefore we ran a sensitivity analysis where we dropped all observations linked to either experiment 3, cell line 4# or cell line 9#, refit the model and visualized the resulting change in ALS status associated mean soma perimeter value. There were only relatively slight changes in the estimated changes in the mean, so that the conclusions we would draw of an increase in the soma cell perimeter in ALS lines remains irrespective of whether or not we drop observations from the experimental batch or the two cell lines.

 We will use the log transformed cell_size2 data for parameter estimation using the following commands. 

```{r}
design_update <- design
data_update <- diagnose_res$Data_updated
print(colnames(data_update))
design_update@response_column <-  "cell_size2_logTransformed"
estimate_res <- getEstimatesOfInterest(data_update, design_update, model,print_plots = FALSE)
```

```{r, results="asis", echo = FALSE}
print(estimate_res[[2]]$plots)
knitr::asis_output(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
plot_no <- plot_no + 1

```

The median residual boxplot shows the differences in the distribution of the response variable for the two conditions.

```{r}
##print model summary output
print(estimate_res[[1]])
```

As a result of linear mixed model-based association analysis, the response variable was found to be significantly associated with the condition variable with an estimate of 0.03.

We can also estimate the statistical needed to estimate the observed differences in mean perimeter across 15 experimental batches

```{r}
power_param@max_size <- 15
power_res <- calculatePower(data_update, design_update, model, power_param)
```

```{r,results="asis", echo = FALSE}
print(power_res$plots)
knitr::asis_output(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
plot_no <- plot_no + 1
```

The power curve shows that at least eight or more experimental batches are required to achieve a power of 80% or greater. The points in the plot represent the mean estimate of power over the simulations while the error bars represent the 95% confidence intervals.

Accordingly, we will now load data from 8 experimental batches and estimate our parameter of interest.

```{r}
full_data <- plate_assay_full_data

##filter the top 5% of the observations
full_data %<>% filter(perim_2th_effect < quantile(perim_2th_effect, 0.95))

design@response_column <- "perim_2th_effect"
design@condition_column <- "classif"
full_diag_res <- diagnoseDataModel(full_data, design, model)
full_data_update <- full_diag_res$Data_updated
full_design <- design
full_design@response_column <- "perim_2th_effect_logTransformed"
```

Note now, none of the experimental batches or the 13 cell-lines are considered outliers or unduly influencing the final parameter estimates.

```{r, results='asis'}
cooks_plots <- full_diag_res$cooks_plots$log_scale
cat("\n\n\n")
for(i in 1:length(cooks_plots)) {
  print(cooks_plots[[i]])
  cat(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
  plot_no <- plot_no + 1
}

print("Inferred outlier groups are")
print(diagnose_res$inferred_outlier_groups$natural_scale)

```

We derive an estimate of 0.027 as the mean difference in perimeter with an associated significance of 0.006 using the data from 8 experimental batches.

```{r}
full_res <- getEstimatesOfInterest(full_data_update, full_design, model, print_plots = FALSE)
print(full_res[[1]])
```

```{r, results="asis", echo = FALSE}
print(full_res[[2]]$plots)
knitr::asis_output(sprintf("\n\n\n**Figure %d:**\n\n", plot_no))
plot_no <- plot_no + 1
cat("\n\n\n")
```

# Website

Please check https://gladstone-institutes.github.io/RMeDPower2/index.html for more detailed tutorials in other applications involving single cell RNA, mouse behavior and electrophysiology experiments. These will include illustration of the application of RMeDPower2 to non-normal count data and an explanation of the simulation setup for sample size calculations. There is also an guide to defining and explaining the required classes by the package. The guide include an Rshiny app to help an user appropriately define the classes appropriate for their experimental design and question of interest.


# Session info

```{r sessionInfo, echo=FALSE}
sessionInfo()
```