Spatial Phylogenetics in R using ‘phylospatial’

This tutorial guides you through the workflow of conducting a spatial phylogenetics analysis in R, using the very recently developped R package phylospatial. This tutorial relies heavily on the vignettes of this package written by Matthew Kling. You can find more details about the functions and the package by visiting the repository https://matthewkling.github.io/phylospatial/. One of the most important developments of this package is the fact that it is able to deal with different types of input occurrence data, from presence/abscence, to abundances, to probabilities of occurrence derived from SDMs (Species distribution models).

Hands on!

Let’s install the packages we will need.

# Let's make sure we have the libraries we need
# if you haven't installed them, uncomment the line below
# install.packages('phylospatial')
library(phylospatial)
library(terra)
library(ape)
library(sf)
library(tmap)

1) The data

We will use the dataset that is distributed with the library phylospatial. First let’s read a raster data that contains the community data

moss_comm <- rast(system.file("extdata", "moss_comm.tif", package = "phylospatial"))

# check the structure of this object by uncommenting and running the line below
# moss_comm

# plot it!
plot(moss_comm)

Now we read the phylogeny and plot it.

moss_tree <- read.tree(system.file("extdata", "moss_tree.nex", package = "phylospatial"))

# check what the object looks like by uncommenting the line below
# moss_tree

# plot it
plot(moss_tree, type = "fan", cex = 0.15)

Now let’s use the function phylospatial() to create an phylospatial object. This function requires two arguments: the community data set, and a phylogeny of the terminals (usually species) in such dataset.

ps <- phylospatial(moss_comm, moss_tree)
plot(ps, "comm")

Questions A

What type of objects are comm, moss_tree, and ps?
What is the structure of the phylospatial object ps? Use the console to explore what information is stored un this object.

Important notes regarding data structure

Phylospatial objects support the use of community data as probabilities (for example the output of species distribution models), abundances, or binary (presence-abscence) data. The phylospatial function will identify the type of data provided, but you can explicitly specify it by declaring the argument data_type= when constructing the phylospatial object.
Notice the tree tips and taxa in the community dataset must be identical.

2) Patterns of alpha phylogenetic diversity

Now that we have our spatiaphylo object that contains information on taxa occurrences in each pixel and a phylogeny with information on the phylogenetic relantionships of those taxa, we might naturally want to know what is the PD of each pixel. The package phylospatial computes all the alpha-phylogenetic diversity metrics with the function ps_diversity(). In the argument metric= we can specify which metrics we want to compute. Today, we will compute taxon diversity (TD), phylogenetic diversity (PD), and phylogenetic endemism (PE) as describe below (but notice that there are many others metrics that can be calculated, you can explore that by looking up the function documentation ?ps_diversity()):

TD—Terminal Diversity, i.e. richness of terminal taxa.
TE—Terminal Endemism, i.e. total endemism-weighted diversity of terminal taxa, a.k.a. “weighted endemism”.
PD—Phylogenetic Diversity, i.e. total branch length occurring in a site.
PE—Phylogenetic Endemism, i.e. endemism-weighted PD.

# compute the alpha diversity metrics
div <- ps_diversity(ps, metric = c("TD","TE", "PD", "PE"))

Use the console to explore the structure of the object div. And let’s plot it!

# plot diversity metrics

#PD
TD_plot <- tm_shape(div$TD) + 
      tm_raster(palette = "inferno", style = "cont") +
      tm_layout(legend.outside = TRUE) + tm_title("Taxon diversity, TD")

TE_plot <- tm_shape(div$TE) + 
      tm_raster(palette = "inferno", style = "cont") +
      tm_layout(legend.outside = TRUE) + tm_title("Taxon endemism, TE")

PD_plot <- tm_shape(div$PD) + 
      tm_raster(palette = "inferno", style = "cont") +
      tm_layout(legend.outside = TRUE) + tm_title("Phylogenetic diversity, PD")


PE_plot <- tm_shape(div$PE) + 
      tm_raster(palette = "inferno", style = "cont") +
      tm_layout(legend.outside = TRUE) + tm_title("Phylogenetic endemism, PE")

# plots
tmap_arrange(TD_plot, TE_plot, PD_plot, PE_plot, nrow = 2, ncol = 2)

Questions B

How do you interpret each metric?
What do the scales represent? Look up the documentation of the function to find the formulas for calculating each metric
Are the different metrics related?

In the figure above, we observe that the phylogenetic diversity is highly correlated with the taxon diversity. The more terminals occur in a pixel, the more branch lengths we add when computing PD. If we are interested in knowing wether a pixel has relatively longer branches while correcting for the number of taxa in that pixel, we can use “relative metrics”. Let’s visualize how PD and relative phylogenetic diversity look like for our dataset.

# calculate pd an rpd
pd_rpd <- ps_diversity(ps, metric = c("PD","RPD"))

# plot them

pd_plot  <- tm_shape(pd_rpd$PD) + 
            tm_raster(palette = "inferno", style = "cont") +
            tm_layout(legend.outside = TRUE) + tm_title("Phylogenetic diversity, PD")

rpd_plot <- tm_shape(pd_rpd$RPD) + 
            tm_raster(palette = "inferno", style = "cont") +
            tm_layout(legend.outside = TRUE) + tm_title("Relative phylogenetic diversity, RPD")

tmap_arrange(pd_plot, rpd_plot, ncol = 2)

Questions C

What differences do you notice with these two metrics in the spatial patterns of California moss?
How do you interpret those differences?

3) Significance testing and categorical analyses of neo- and paleo-endemism

So far we have explored the spatial patterns of diversity metrics. Our observations allow us to compare the diversity across pixels. Nevertheless, we would like to identify regions that have significantly different values than what we would expect, i.e., we want to assign a significance value to our observations. For that purpose, we will use a randomization test. A randomization test involves comparing the observed data to a null expectation. So in this case, we will need to generate a null expectation of what the values of the diversity metrics would be for each pixel. The function ps_rand() internally applies the quantize() algorithm to generate a null hypothesis based on running multiple stratified randomizations.

To run this function, we need to pass on a phylospatial object, indicate the number of randomizations we want to use to create our null expectation, and indicate the metrics that we want to compare. Notice this step might take some time.

# Run the randomizations for our moss dataset, you can increase the number of randomizations modifying the argument n_rand
rand <- ps_rand(ps, n_rand = 50, progress = F,
                metric = c("PD", "PE", "CE", "RPE"))

If you investigate the output of this function, you will observe that it consists of one raster per metric. Let’s plot one of these rasters.

tm_shape(rand$qPE) + 
      tm_raster(palette = "inferno", style = "cont") +
      tm_layout(legend.outside = TRUE)

The values that are plotted indicate, for each cell, the proportion of the randomizations in which the observed value of the metric is larger than the randomization value. For example in the figure above, if a cell has a value of 0.8 this means that in 80% of the randomizations the observed value of PE was greater than the simulated value. In science we often use significance values of alpha = 0.5. If we were to apply such criterium in this scenario, we would only consider the cells with values greater than 0.95 as having significantly higher PE than expected.

CANAPE

Researches may be interested in testing the significance of particular metrics of phylodiversity, and they can use variants of the code above to do so. But one analysis that has become standard in spatial phylogenetics is CANAPE (a categorical analysis of neo- and paleo-endemism, described in Mishler et al. 2014. CANAPE consists on two steps. First, the pixels that have significantly high phylogenetic endemism (PE) are identified. And second, each of these pixels with significantly high PE are assigned into categories of endemism based on RPE, which as we discussed earlier reflects the relative length of the branches in a pixel in comparison with other pixels.

# run CANAPE
# notice we can modify the value of alpha
cp <- ps_canape(rand, alpha = .05)

# plot the results
terra::plot(cp)

The categories of significant endemism to which the pixels may be assigned are:

neoendemism: pixels that have significantly high endemism due to significantly relative short branches as measured in RPE.
paleoendemism: pixels that have significantly high endemism due to significantly relative long branches indicated by RPE
mixed-endemism: pixels that have significantly high endemism but no significance is detected in RPE, which is interpreted as a mix of short and long branches.
super-endemism: its a subcategory of mixed-endemism with particularly high significant endemism.

Questions D

What’s the importance of using randomization tests?
How do you interpret each type of endemism category? Can you think of any evolutionary process that might explain finding such significance values?

4) Patterns of beta phylogenetic diversity

Another interesting aspect of diversity studies is to understand the spatial distribution of individual taxa, answering questions like how similar are communities across space? Is there an evident turnover of communities?. We will also address this question from the phylogenetic diversity perspective.

For this, we will start by computing a pairwise phylogenetic dissimilarity index based on the Bray-Curtis distance, also known as quantitative Sorensen’s dissimilarity. The function ps_add_dissim() will add a dissimilarity matrix to the spatialphylo object.

ps <- ps_add_dissim(ps, method = "sorensen", endemism = TRUE, normalize = TRUE)
ps

## `phylospatial` object
##   - 884 lineages across 1116 sites
##   - community data type: probability 
##   - spatial data class: SpatRaster 
##   - dissimilarity data: sorensen

In the code above setting endemism = TRUE will scale every lineage’s occurrence values to sum to 1 across all sites, giving greater weight to narrowly distributed taxa. Setting normalize = TRUE scales every site’s total occurrence value to sum to 1 across taxa, which results in a distance matrix that emphasizes proportional differences in composition rather than alpha diversity gradients.

Once we have computed the phylogenetic dissimilarity matrix, we might want to visualize how phylogenetically similar are the pixels. For this, we will use the function ps_rbg(). Internally, this function will perform an ordination analysis—in this case we will use a CMDS (classical multidimensional scaling)—that reduces the variation in the communities to three components, which are then used to define colors (using the RBG bands of color space). As a result, the more similar the colors, the more phylogenetically similar are the communities.

ps %>% # take the phylospatial object
      ps_rgb(method = "cmds") %>% # declare what ordination algorithm we want to use
      tm_shape() + # and make a plot
      tm_rgb(max.value = 1, interpolate = FALSE) +
      tm_layout(title = "Phylogenetic community ordination")

While this ordination visualization helps to visualize similarity, it might be difficult to decide where to divide the map if we are trying to do a regionalization of the geographic. For that purpose, we can use a cluster analysis with the function ps_regions(). This function will require us to decide the number of regions that we want to cluster the space in. While there is no perfect answer to define the number of regions and the decision will be somewhat arbitrary, a common approach in clustering methods is to select the least amount of categories k that explain most of the data. To find out that value for our data we will use the function ps_regions_eval().

ps_regions_eval(ps, k = 1:15, plot = TRUE, method = "average")

The plot above suggests that K=3 might be a sensitive value for our clustering analyses. So now we can run a regionalization with k=3, and subsequently plot it.

ps %>%
      ps_regions(k = 3, method = "average") %>% # regionalization
      tm_shape() + # plot
      tm_raster(style = "cat", palette = "Dark2",
                title = "phylogenetic region") +
      tm_layout(legend.outside = TRUE)

It is important to mention that there are multiple algorithms that can be used to generate the clusters, and the regionalization might vary depending on which one is selected, we recommend to read more about clustering methods before making a decision on which algorithm to use.

Questions E

Play with the different values of K. Given the fact that different regionalization algorithms might output different results, how would you approach doing a regionalization for your study system? 2

Final note

By now, I hope you have an idea of the type of analyses you can do with this R package. This tutorial is in no way exhaustive, as there are many other functionalities to explore. I encourage you to use the vignettes of the packages to learn more. In particular an aspect that might be of interest to many is the direct application in conservation with the included prioritization algorithm.