sensiPhy: R-package for sensitivity analysis in phylogenetic comparative methods.

Gustavo Paterno, Caterina Penone, Gijsbert Werner

2020-04-02

##Introduction to sensiPhy The sensiPhy package provides simple functions to perform sensitivity analyses in phylogenetic comparative methods. It uses several simulation methods to estimate the impact of different types of uncertainty on Phylogenetic comparative methods:

  1. Species Sampling uncertainty (sample size; influential species and clades)
  2. Phylogenetic uncertainty
  3. Data uncertainty (intraspecific variation and measurement error)

Functions for sensitivity analysis

sensiPhy functions use a common syntax that combines the type of uncertainty and the type of analysis:

where “xxx” can be one of the 5 sensiPhy methods (see Figure below):

sensiPhy workflow & functions

Additional functions

1. Sampling uncertainty

1.1 Sensitivity analysis for influential species

The influ method performs leave-one-out deletion analysis and detects influential species on parameter estimates.

1.1.3 influ_continuous() and influ_discrete: Trait Evolution Continuous and Discrete Characters

Continuous characters

Different evolutionary models from fitContinuous can be used: BM,OU, EB, trend, lambda, kappa, delta and drift.

Discrete characters

Different character models from fitDiscrete can be fit by changing the model argument. These include ER (equal-rates), SYM (symmetric), ARD (all-rates-different) and meristic (stepwise fashion). Similarly, all transformations to the phylogenetic tree from fitDiscrete can be used: none, EB, lambda, kappa and delta.

1.2. Sensitivity analysis for influential clades

The clade method can be used to estimate and test the influence of specific clades on parameter estimates.

1.2.1 clade_phylm(): Phylogenetic linear regression

Additional arguments: clade.col: The name of a column in the provided data frame with clades specification (a character vector with clade names). n.species: Minimum number of species in the clade in order to include this clade in the leave-one-out deletion analysis. Default is 5. n.sim: The number of repetitions for the randomization test

sensi_plot(clade, "Cebidae")

1.3. Sensitivity analysis for sample size

The samp method performs analyses of sensitivity to species sampling by randomly removing species from the dataset and detecting the effects on parameter estimates.

1.3.1 samp_phylm: Phylogenetic Linear regression

1.3.2 samp_physig: Phylogenetic signal (k or lambda)

1.3.3 samp_continuous() and samp_discrete: Trait Evolution Continuous and Discrete Characters

Continuous characters

Discrete characters

2. Phylogenetic uncertainty

The tree method performs sensitivity analysis to estimate the influence of phylogenetic uncertainty on parameter estimates.

Additional arguments:
n.tree: Number of times to repeat the analysis with n different trees picked randomly in the multiPhylo file.

2.1 tree_phylm: tree method for phylogenetic linear regression

2.2 tree_physig(): tree method for phylogenetic signal

2.3 tree_continuous() and tree_discrete: tree method for trait evolution of continuous and discrete characters

Continuous characters

Discrete characters

3. Data uncertainty

The intra method performs Sensitivity analysis for intraspecific variation and measurement error

3.1 intra_phylm(): intra method for phylogenetic linear regression

Additional arguments:
Vy: Name of the column containing the standard deviation or the standard error of the response variable. When information is not available for one taxon, the value can be 0 or NA.
Vx: Name of the column containing the standard deviation or the standard error of the predictor variable. When information is not available for one taxon, the value can be 0 or NA
x.transf: Transformation for the response variable (e.g. log or sqrt). Please use this argument instead of transforming data in the formula directly.
y.transf: Transformation for the predictor variable (e.g. log or sqrt). Please use this argument instead of transforming data in the formula directly.
distrib: A character string indicating which distribution to use to generate a random value for the response and/or predictor variables. Default is normal distribution: “normal” (function rnorm). Uniform distribution: “uniform” (runif) Warning: we recommend to use normal distribution with Vx or Vy = standard deviation of the mean.
n.intra: Number of times to repeat the analysis with n different trees picked randomly in the multiPhylo file. If NULL, times = 2

3.2 intra_physig(): intra method for phylogenetic signal

Additional arguments:
V: Name of the column containing the standard deviation or the standard error of the trait variable. When information is not available for one taxon, the value can be 0 or NA.

4. Interacting uncertainties

The ‘sensiPhy’ package can perform sensitivity analysis by interacting multiple types of uncertainty (described above).

‘sensiPhy’ implements functions to study interactions of both phylogenetic uncertainty (tree-functions) and data uncertainty (intra-functions) with sampling uncertainty (clade, influ, and samp), as well as interactions between data and phylogenetic uncertainty (tree_intra).

Both intra and tree methods can be used together with all other uncertainties to evaluate the interaction between two types of uncertainty at the same time.

Examples:

5. Additional functions

5.1 Combine data and phylogeny

The function match_dataphy combines phylogeny and data to ensure that tips in phylogeny match data and that observations with missing values are removed.

This function uses all variables provided in the ‘formula’ to match data and phylogeny. To avoid cropping the full dataset, ‘match_dataphy’ searches for NA values only on variables provided by formula. Missing values on other variables, not included in ‘formula’, will not be removed from data.

5.2 Phylogenetic signal for missing data

Background

The supplementary function miss.phylo.d tests if there is a phylogenetic signal in missing data, using the D framework developed in Fritz & Purvis (2010). A phylogenetic signal in missing data can bias comparative analyses, because it means that absence of data points is not independent of the phylogeny. This function can help alert users to this problem, suggesting that more balanced data collection would be required.

The function calculates D statistic for missing data. Missingness is recoded into a binary variable (1 = missing, 0 = non missing).

5.3 Diversification and speciation rates

Background

The funcion tree_bdestimates diversification and speciation rates evaluating uncertainty in tree topology. It estimates net diversification rate using geiger::bd.ms Magallon and Sanderson (2000) method or speciation rate using geiger::bd.km Kendall-Moran method for n trees, randomly picked from a multiPhylo file.

6. Using sensiPhy with other packages (caper, phytools, gls)

sensiPhy leverages comparative methods implemented in specific R packages:
1. phylolm for pgls regressions
2. phytools for phylogenetic signal estimates
3. geiger for macroevolutionary models (continuous and discrete trait evolution)

Nevertheless, users can still use sensiPhy to perform sensitivity analysis when they had performed their initial comparative analyses in another package.

For instance, if you have used caper, phytools or gls to fit your full models, you can still use sensiPhy to test the sensitivity of those models to multiple types of uncertainty. In order to do this, users will need to set arguments that mirror the specific parameters from their full analysis. For example, to compare results one needs to use the same evolutionary model in pgls (e.g “lambda”, “BM”, etc), or the same metric for phylogenetic signal tests (K or lambda). See below for some examples:

6.1 PGLS regression

### prepare dataset for caper pgls:
library(caper)
library(phytools)
library(phylolm)
library(sensiPhy)
# selected variables from dataset:
alien.data <- alien.data[c("gestaLen", "adultMass")]

# create variable with species names:
alien.data$sp <- rownames(alien.data)

# caper
# prepare comparative dataset:
comp.dat <- comparative.data(data = alien.data, phy = alien.phy[[1]], 
                             names.col = "sp")
# check comparative dataset:
print(comp.dat)

### Run PGLS analysis (full model)
# using caper (lambda)
fit.caper.lam <- pgls(log(gestaLen)~ log(adultMass), comp.dat, lambda="ML")
coef(fit.caper.lam)

# using caper (delta)
fit.caper.del <- pgls(log(gestaLen)~ log(adultMass), comp.dat, delta="ML")
coef(fit.caper.del)

# using phylolm (lambda)
fit.phylo.lam <- phylolm(log(gestaLen)~ log(adultMass), comp.dat$data, comp.dat$phy,
                 model = "lambda")
coef(fit.phylo.lam)

# using phylolm (del)
fit.phylo.del <- phylolm(log(gestaLen)~ log(adultMass), comp.dat$data, comp.dat$phy,
                         model = "delta")
coef(fit.phylo.del)

### run sensitivity analysis with sensiPhy (influential species):
library(sensiPhy)
# run analysis (lambda):
lam <- influ_phylm(log(gestaLen) ~ log(adultMass), phy = comp.dat$phy, 
                     data = comp.dat$data, model = "lambda")
# check full model estimates and compare to initial estimates:
lam$full.model.estimates
# test for influential species:
summary(lam)

# run analysis (delta):
del <- influ_phylm(log(gestaLen) ~ log(adultMass), phy = comp.dat$phy, 
                     data = comp.dat$data, model = "delta")
# check full model estimates and compare to initial estimates:
del$full.model.estimates
# test for influential species:
summary(del)

7. How long does it take?

Although most functions implemented in sensiPhy are reasonably fast (runtime less them 1 min), it is important to be aware that some functions might take a long time to run. This is more likely when analyzing models with a high number of species (>1000) or number of repetitions, and also when interacting two types of uncertainties (e.g. functions: tree_clade and tree_samp). To keep track of the time while running functions, we recommend users to set the argument “track = TRUE” which prints a progress bar and gives a good idea of how long the analysis will take before it finishes. As an indication, some examples of elapsed time on a standard desktop computer [Linux, Intel i7, 2.7GHz, 12GB ram] are presented below:

Analyses with single uncertainty: tree_phylm

function N species N trees time
tree_phylm 100 30 ~ 2 sec
tree_phylm 100 100 ~ 5 sec

Analyses with single uncertainty: clade_phylm

function N species N sim time
clade_phylm 100 100 ~ 7 sec
clade_phylm 100 300 ~ 31 sec
clade_phylm 100 500 ~ 61 sec

Analyses with single uncertainty: samp_phylm

function N species N sim time
samp_phylm 100 100 ~ 11 sec
samp_phylm 100 300 ~ 25 sec
samp_phylm 100 500 ~ 43 sec
samp_phylm 500 100 ~ 22 sec
samp_phylm 1000 100 ~ 35 sec

Analyses with single uncertainty: influ_phylm

function N species time
influ_phylm 100 ~ 4 sec
influ_phylm 500 ~ 43 sec
influ_phylm 1000 ~ 148 sec

8. How to support sensiPhy

Thanks for you interest in contributing with sensiPhy. First you need to register at Github. You can help developing sensiPhy in different ways:

8.1 Report bugs or errors

If you have any problem running sensiPhy functions, please report the problem as a new issue. To do this, follow the steps below:

8.2 Small changes in sensiPhy documentation

If you find a typo on sensiPhy documentation or if you want to make a small change on any file, you can do it online. Search for the file you want to change at the master branch, then:

8.3 Contribute with code

If you want to make a bigger contribution and help us with sensiPhy code:

  1. Fork it (https://github.com/paternogbc/sensiPhy/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

8.4 New ideas or suggestions?

If you need to ask something via email, send to paternogbc@gmail.com

8.5 Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.