A Bayesian total evidence dating (TED) analysis simultaneously infers the topology and the divergence times of a set of terminals. This approach has several advantages. First, it doesn’t require us to place the fossils in a specific clade as is required for node dating; instead, the fossil specimens are placed in the phylogeny based on their own morphological and temporal information. As a result, we obtain posterior probabilities on the fossil placements, as opposed to having to assume their phylogenetic position. Second, with this approach, we can include more than one fossil putatively belonging to a group, whereas in node dating we usually pick only the oldest fossil to calibrate a node of interest. By including all the fossils available for a group we may have a more complete picture of it.
But as interesting and promising as Bayesian TED analyses are, they are complex. As we analyze more empirical data under this model, we are starting to better understand the interaction of the different parts of the model. And it has become clear that it is important that we understand the implications of our modeling decisions. This tutorial was written by an empiricist with interest in TED methods, so this tutorial is intended as a practical guide to doing a TED analysis rather than a theoretical guide. The objective of it is to serve as an introduction to the Bayesian Total Evidence Dating Model, using the software RevBayes. I will focus on discussing the different components of the model, highlighting the way we express our prior beliefs about our study system, and how these components and priors translate into RevBayes language scripts. An equally important objective for me is to facilitate an introduction to the “linguo” of Bayesian TED analyses and the structure of RevBayes. Throughout the tutorial, I will point out several references and other tutorials where you will be able to dig into some of the topics we will only superficially cover. The code for this tutorial can be easily adapted to run a total-evidence analysis on your own data. For a more advanced tutorial on Bayesian TED analyses–including code for comparing different models and assessing the adequacy of the model–visit this page. Finally, once you have completed this introductory workshop, I invite you to check out the more extensive lecture recordings of our Time Divergence Estimation Workshop.
The type of data.
Before diving into the details of the model it is worth mentioning the type of data you need to use the Bayesian TED model. When the study group includes extant and extinct terminals, the dataset typically includes both molecular and a morphological traits. But a TED analysis can also be performed using only morphological traits. This is particularly plausible in a case where all the terminals are fossils. There is nothing restrictive about the number of fossils used, but some recommendations are to have fossils widespread in time and among lineages.
The Bayesian TED model
A Bayesian Total Evidence Dating model can be understood as having several components, that are themselves models of specific processes (Fig. 1). All these components fit together in a general TED analysis, where they have the potential to inform each other through their interactions. As we can see in figure one, we have models of character evolution that describe the way in which characters (morphological or molecular) change across lineages. We have clock models that describe the rate at which changes (morphological or molecular) occur along lineages. And we also have a timetree model that describes how lineages change (appear and go missing) through time. So let’s walk through an example and review each of the five components of a Bayesian TED analysis!
In this tutorial we will work through an example of a Total Evidence Dating analysis using a very reduced dataset of extant and extinct species in the tomato family: Solanaceae. You can download all the data files and scripts for this tutorial from this repository. In this tutorial we will be working with a set of 12 terminals, of which two are fossils. The molecular data are the alignments of three molecular markers: ITS, TrnLF, and waxy. The morphological matrix contains 16 traits.
The files provided with this tutorial have a specific structure that
facilitates clarity and organization. To avoid any problems,
don’t modify the provided structure. The main folder is
called tutorial_evolution_2022
. All the scripts are written
to be run from this main directory. You can run the complete analysis by
opening your terminal in the folder
tutorial_evolution_2022
, and running the following
command:
rb scripts/TED_tutorial_evolution.Rev
But the objective of the tutorial is to explain some details of the code and how it reflects our modeling decisions and priors. So it is worth taking your time and running the analysis step by step.
The main directory contains three sub-directories: data
,
scripts
, and output
. The folder
data
contains the molecular alignments, the morphological
data and the list of terminals with age information for all the
terminals represented by two columns: the maximum and the minimum
stratigrafical age. Feel free to explore the structure of each of these
files by opening them in your text editor. The folder
scripts
contains all our code, and the folder
output
is where all our results files will be stored.
As we mentioned before, a Bayesian TED analysis is modular: there are
natural components–models by themselves–that come together into a larger
and complex model. The organization of the scripts reflects this
modularity. There is a main script,
TED_tutorial_evolution.Rev
, that puts together the
different components by calling them with the function
source()
:
# example, don't run this code
# I am running another script called "my_other_script.Rev"
source("my_other_script.Rev")
So let’s get started in the main script. In your text editor open the
script TED_tutorial_evolution.Rev. You can follow along
by copying and pasting the pieces of code in your terminal. Make sure
you run the code from the folder
tutorial_evolution_2022
.
We will start by reading all the data: the sequence data, the morphological data and the taxon data. We will also obtain some useful information from it, like the number of alignments, the number of taxa, the number of branches in the tree, and the number of fossils in the dataset.
# read the sequence data
# get the name of our files
markers <- v(
"data/alignments/ITS_aligned.fasta",
"data/alignments/trnLF_aligned.fasta",
"data/alignments/waxy_aligned.fasta"
)
# how many alignmentes do we have
naln <-markers.size()
#read the alignments
for(i in 1:naln) {
moledata[i] = readDiscreteCharacterData(markers[i])
}
# read the morphological data
morphdata = readDiscreteCharacterData("data/morphology.nexus")
# read the taxon data
taxa = readTaxonData("data/taxa.tsv", delimiter=TAB)
ntax = taxa.size() # number of taxa
nbranch = 2 * ntax - 2 # number of branches in rooted phylogeny
# count the number of fossils using a for loop
num_fossils = 0
for(i in 1:taxa.size()) {
if (taxa[i].getMinAge() != 0) {
num_fossils = num_fossils + 1
}
}
When we use more than one molecular marker, as in this example, it is often the case that some terminals are not represented in all the alignments, but RevBayes requires that all the alignments have the same taxa. The next for loop makes sure that all the terminals are in all the alignments.
# the alignments do not have the total number of taxa,
# we need to manually add missing taxa to the sequence data:
for(i in 1:naln) {
x = moledata[i]
x.addMissingTaxa(taxa)
moledata[i] = x
}
Next, we will prepare the morphological data. As opposed to molecular data that only and always has four possible character states (A, C, G, T), morphological characters may have different number of states. Later in the tutorial, we will use a Mk model for morphological evolution, that requires that our morphological data is partitioned by number of states. We will do this with a for loop, that iterates over the possible number of states, and retains only the characters that have a given number of states.
# First, we add missing taxa to the morphological data
morphdata.addMissingTaxa(taxa)
# partition morphological data by number of states
# the maximum number of states in my dataset
n_max_states = 10 # consider changing this for your own data, if necessary
# start a counter
idx = 0
# for each number of character states
for (i in 2:n_max_states) {
# make a tempory character alignment
morph_tmp <- morphdata
# retain characters with i states
morph_tmp.setNumStatesPartition(i)
# if there are more than zero characters with this many states...
# store the data partition
if ( morph_tmp.nchar() > 0 ) {
morph_partitions[++idx] = morph_tmp
nc[idx] = i
}
}
# compute the number of morphological partitions
num_morph_parts = morph_partitions.size()
Now our data is ready!
An important part of writing code in RevBayes is specifying the moves that the Markov chain Monte Carlo (MCMC) will use to make the proposals on the value of the parameters that we are estimating. We always specify at least one move per parameter, and sometimes we can use moves that operate in more than one parameter at the time (e.g. increasing the value of one parameter and decreasing the value of another). We usually specify the moves right after we specify the prior distribution of that parameter. We will keep track of all the moves we will use in our analysis by storing them in a single container: a vector of moves:
# create empty vector of moves
# we will fill it as we specify moves
moves = VectorMoves()
As the MCMC explores the parameter space, it will keep track of the values of the parameters that we are estimating using monitors (functions that usually record variables to a file). Similarly to the container for the moves, we need a container for the monitors:
# monitors vector
monitors = VectorMonitors()
The next piece of code will store in memory the number \(H\), which happens to have some cool mathematical properties, like \(\pi\) does.
# convenient constants
H = ln(10) / (qnorm(0.975) - qnorm(0.025))
We are now ready to start specifying our model! If you want to learn more about moves, you can follow the tutorial Introduction to MCMC using RevBayes.
The first component of the model is the tree model. In the main script we call it with:
source("scripts/time_tree_model.Rev")
To see the script go to
tutorial_evolution_2022/scripts/time_tree_model.Rev
. We
will use a tree based on a Fossilized Birth Death
Process (commonly called FBD tree). In a few
words the simplest Fossilized Birth Death Process has three
parameters:
Our decision of using a FBD process as the base of our timetree implies that we believe that this is a good representation of how a lineage evolves through time.
Notes for your thoughts: What do you think? Is this a satisfactory model?
In today’s tutorial we will use a constant rate FBD process, where speciation, extinction, and fossilization are the same through time and across lineages. But this model can be extended to accommodate variation in the rates. If you are interested in learning more about the FBD process, I recommend visiting this tutorial and this tutorial.
Let’s start by specifying some dataset-specific information. First because we haven’t sampled all the extant taxa, we will specify a sampling fraction parameter, rho:
# # First we specify some dataset-specific parameters
extant_taxa = 9 # total number of extant taxa in the tree
total_taxa = 2800 # total number of extant taxa in the group
rho <- extant_taxa/total_taxa # sampling fraction
The timetree model requires that we specify a parameter for the time
of the start of the FBD process. In this case, we will specify an origin
time for the stem lineage that gave rise to our focal group
(Solanaceae), and our outgroup (Convulvulaceae, represented in our
dataset by Ipomoea batatas). We will specify the age of that
clade based on the work of Magallón
et al. (2015) who inferred the 95% HPD crown age of Solanaceae
between 85.91 and 100.4 Mya. Without any other available information, we
will use a uniform prior on the origin_time
extending along
this age range. And we will add a move for this parameter.
origin_min = 85.91
origin_max = 100.34
origin_time ~ dnUniform(origin_min, origin_max) # Prior
moves.append( mvSlide( origin_time, weight = 5) ) # Moves
Notes for your thoughts: Can you think of a different way of specifying a starting point for the FBD process? Is there a different source of information you would use?
As we mentioned before we will use the simplest FBD tree model; one where there rates don’t vary through time or across lineages. We will start by specifying an empirical prior (i.e. a prior that is informed by our data) on the diversification rate such that, in average, it gives rise to the total number of taxa we observe today in Solanaceae. The standard deviation of this distribution is \(H\), which implies that the true diversification rate is within one order of magnitude of the prior mean.
# empirical prior on the diversification rate
diversification_prior_mean <- ln(total_taxa) / origin_time
diversification_prior_sd <- H
# the diversification rate
diversification ~ dnLognormal( ln(diversification_prior_mean) - diversification_prior_sd * diversification_prior_sd * 0.5, diversification_prior_sd)
# moves on the diversification rate
moves.append( mvScale(diversification, weight = 1.0) )
We will draw the relative extinction rate from a uniform distribution prior that ranges between 0 and 1. The extinction rate cannot be negative. And if it was higher than 1, we probably wouldn’t have any lineages in the present. We then add a move on the relative extinction rate.
# the relative extinction rate
relext ~ dnUniform(0, 1)
moves.append( mvSlide(relext, weight = 1.0) )
In this example we chose to parametrize our model by starting with priors in the diversification rate and the relative extinction rate, because it is easier to put a prior on diversification, than it is to put it on speciation. Then, we transform the diversification and relative extinction rates into speciation and extinction rates.
# transform to speciation and extinction
lambda := abs(diversification / (1 - relext))
mu := abs(lambda * relext)
Notes for your thoughts: what kind of evolutionary events could we accommodate if we allow these rates to vary? Are those events relevant for all the empirical cases?
Once again, we are using a constant rate model; so we assume that the fossilization rate is constant through time and across lineages. We will use an empirical prior that centers the fossilization rate such that, on average, we expect to see the number of fossils that we observe in our empirical dataset:
# empirical prior on the fossilization rate
fossilization_prior_mean <- num_fossils * diversification_prior_mean / (exp(diversification_prior_mean * origin_time) - 1)
fossilization_prior_sd <- 2 * H
We then draw the fossilization rate \(\psi\) from a lognormal distribution, and add some moves on this parameter:
# the fossilization rate
psi ~ dnLognormal( ln(fossilization_prior_mean) - fossilization_prior_sd * fossilization_prior_sd * 0.5, fossilization_prior_sd)
# add moves
moves.append( mvScale(psi, weight = 1.0) )
Now that we have defined all the elements for an FBD tree (starting
time, and the speciation, extinction and fossilization rates) we can
create our FBD distribution of trees dnFBDP
from which we
will sample our timetrees:
# make the FBD tree
timetree ~ dnFBDP(
originAge = origin_time,
lambda = lambda,
mu = mu,
psi = psi,
condition = "survival",
taxa = taxa,
rho = rho
)
As with any parameter that is being estimated, we need to set-up some moves on the tree. In fact, the tree topology is hard to estimate, so will place a lot of moves on it. Let’s start with the topology and branch lengths:
# Proposals on the tree topology
moves.append( mvSubtreeScale(timetree, weight=ntax/5.0) )
moves.append( mvFNPR(timetree, weight = ntax ) )
moves.append( mvNarrow(timetree, weight = 5 * ntax ) )
moves.append( mvNodeTimeSlideUniform(timetree, weight = 5 * ntax ) )
moves.append( mvSubtreeScale(timetree, weight = 5 * ntax ) )
moves.append( mvRootTimeSlideUniform(timetree, weight = ntax, origin = origin_time) )
Because in this model the fossils can be sampled ancestors, we will add some moves for this parameter and we will keep track of it (i.e. store that variable in the memory of our computer).
# MCMC proposals on whether fossils are sampled ancestors
moves.append( mvCollapseExpandFossilBranch(timetree, weight = num_fossils, origin = origin_time) )
# keep track of number of sampled ancestors
num_sampled_ancestors := timetree.numSampledAncestors()
Next, we will add some moves on the age of the fossils to accommodate
for stratigraphic uncertainty that we expressed in our initial
taxa.tsv
.
# MCMC proposals on the fossil ages
fossils = timetree.getFossils()
for(i in 1:fossils.size()) {
# get the fossil age
t[i] := tmrca(timetree, clade(fossils[i]))
# specify the age contraint
a = fossils[i].getMinAge()
b = fossils[i].getMaxAge()
F[i] ~ dnUniform(t[i] - b, t[i] - a)
F[i].clamp( 0 )
# specify a proposal on the fossil age
moves.append( mvFossilTimeSlideUniform(timetree, origin_time, fossils[i], weight = 1.0) )
}
And our last move is a joint move on the tree and the branch rates. Joint moves are useful when we have parameters that are correlated.
# make a joint proposal for the tree (and later the branch rates)
moves.append( mvUpDownScale(weight=10, tuneTarget=0.234, lambda=0.01) )
tree_scale_move_index = moves.size()
moves[tree_scale_move_index].addVariable(timetree, up=false)
moves[tree_scale_move_index].addVariable(origin_time, up=false)
Finally, we will keep track of the tree length (TL) throughout our analysis, so we will add it:
# keep track of the tree length
TL := timetree.treeLength()
We are now done specifying our timetree model!
Notes for your thoughts: The fossilization rate is a variable that has probably been misnamed. Many paleontologists will argue that a better name for this rate would be fossil recovery rate. What do you think? Does this have implications for our modeling decisions?
Back to the main script, we will specify our molecular evolution model (i.e. the way in which molecular changes occur) with the next piece of code:
source("scripts/mol_evol_model.Rev")
Let’s walk through the code. Go to
tutorial_evolution_2022/scripts/mol_evol_model.Rev
. We will
use a standard GTR + Gamma molecular evolution model
partitioned by marker. In our model, we will specify a Q
matrix (an instantaneous rate matrix) for each marker. Q
matrices contain the rate at which each nucleotide change to a different
nucleotide at a given instantaneous (i.e. very short)
point in time. And because we expect that different positions within our
alignment change at different rates, we will use a discretized
gamma distribution to accommodate this variation.
Because we have 3 molecular markers, and we need a Q matrix
(mole_Q
) and the molecular site rates
(mole_site_rates
) for each of them, we will use a for loop
in our code. Notice that we will also add some moves for each parameter
inside the loop.
# Q matrix per partition
# for each alignment
for (i in 1:naln) {
# the stationary frequency
mole_pi[i] ~ dnDirichlet(rep(1,4))
moves.append( mvBetaSimplex(mole_pi[i], weight = 1.0) )
# the exchangeability rates
mole_kappa[i] ~ dnDirichlet(rep(1,6))
moves.append( mvBetaSimplex(mole_kappa[i], weight = 1.0))
# create the Q matrices
mole_Q[i] := fnGTR(mole_kappa[i], mole_pi[i])
# the site rates drawn from a gamma distribution
alpha[i] ~ dnExponential( 1.0 )
moves.append( mvScale(alpha[i], lambda=0.01, weight=1.0) )
moves.append( mvScale(alpha[i], lambda=0.1, weight=1.0) )
moves.append( mvScale(alpha[i], lambda=1, weight=1.0) )
mole_site_rates[i] := fnDiscretizeGamma( alpha[i], alpha[i], 4)
}
Next, we expect that the rates of molecular evolution vary between
markers too! So, we will accommodate this by creating a rate multiplier
(mole_relative_rates
) that we will use later to scale the
to scale the substitution rate across loci.
# we allow different markers evolve at different rates
# relative-rate multipliers
mole_proportional_rates ~ dnDirichlet(rep(1, naln))
moves.append( mvBetaSimplex(mole_proportional_rates, weight = naln) )
# rescale the rates so the mean is 1
mole_relative_rates := abs(mole_proportional_rates * naln)
I didn’t go into much detail about the molecular evolution model, but if you want to learn more, you can visit this tutorial for more information on the molecular evolution models, or this tutorial to learn more about why and how we partition molecular data.
Notes for your thoughts: As for all the components on the Bayesian TED model, you can choose between different models and ways to parametrize. In practical terms, the molecular evolution parameters of a GTR + G model are often some of the easiest to estimate within this framework.
Now let’s switch gears and think about our morphological evolution model. As we did for the molecular data, we will now specify a model that describes how the morphological changes occur in our lineages.
It is worth taking a minute to discuss the fact that morphological characters are harder to model because they are not easy to generalize. For example, unlike molecular traits, different morphological characters may have a different number of character states; and it is unclear what the best partition scheme is. Accurately modeling morphological character evolution requires expertise in the group and in statistical modeling. The truth is that morphological evolution is an active and promising field of research. If you want to know more about it, you can start in this tutorial.
Back in the main script, we will run our Mk morphological evolution model by sourcing the script:
source("scripts/morph_evol_model.Rev")
You can check the code for this component of the main Bayesian TED
model by going to scripts/morph_evol_model.Rev
.
Today, we will use the Mk model for our morphological traits. This model describes the change from one character to another using a matrix (similar to the Q matrix), in a way that the we need a 2x2 matrix to model the evolution of a trait with 2 character states; a 3x3 matrix to model the evolution of trait with 3 states; and so on. That is the practical reason why we needed to partition our morphological matrix by number of character states.
Regardless of the number of states, the Mk model assumes that all the
changes among character states occur at the same rate.
It is an extension of the JC model of character evolution, and that is
why the function we use is fnJC
. Once more, notice that we
do this in a for loop, so we have one morph_Q
per partition
of morphological data.
# for each morphological partition
for (i in 1:num_morph_parts) {
morph_Q[i] <- fnJC(nc[i])
}
We don’t expect that different characters evolve at the same rate. We will relax that assumption of the Mk model by adding a rate multiplier picked from a discretized gamma distribution. This is analogous to the + Gamma model for molecular evolution.
# Specifying the gamma parameter
# number of categories
num_rate_cats_morph = 4
# prior
morph_site_rate_alpha ~ dnUniform(0, 10)
# moves
moves.append( mvScale(morph_site_rate_alpha) )
# site rates
morph_site_rates := fnDiscretizeGamma(morph_site_rate_alpha, morph_site_rate_alpha, num_rate_cats_morph)
Notes for your thoughts: What other particularities do morphological datasets have? Perhaps in the way we, biologists, collect the data. Yes! We tend to focus only on the characters that vary between terminals! I invite you to read about this ascertainment bias here.
Back again the main script, we will specify our morphological clock model by sourcing the following script:
source("scripts/morph_clock_model.Rev")
In that script, we will describe the rate at which morphological traits evolve along the lineages.
We will assume that the average rate of morphological evolution is drawn from a loguniform prior. So we will set the prior and add some moves on it as follows:
# draw the log of the rate from a uniform distribution
morph_clock_rate_log ~ dnUniform(-10, 1)
moves.append( mvSlide(morph_clock_rate_log) )
morph_clock_rate_log.setValue(-7)
# exponentiate
morph_clock_rate := exp(morph_clock_rate_log)
In this tutorial we will assume that the morphological clock and the
molecular clock are correlated. This is, we will assume
that a lineage whose sequences evolve faster, also has a morphology that
evolves faster, comparatively to other lineages. The way we will
parametrize this is by setting the individual morphological branch rates
(morph_branch_rates
) as a function of the molecular branch
rates (mole_branch_rates
), scaled by the average rate of
morphological evolution relative to the average rate of molecular
evolution.
# the branch-specific morphological rates
morph_branch_rates := morph_clock_rate/mole_branch_rate_mean * mole_branch_rates
Notes for your thoughts: Do you think that molecular and morphological rates of change are correlated? How would you test it?
As we did for the molecular side of the model. We now have to specify our CTMC to feed our morphological data to our model. Back in the main script we will use a for loop to do this for each partition:
# Make the CTMC for morphological data
for (i in 1:morph_Q.size()){
morph[i] ~ dnPhyloCTMC(timetree, morph_Q[i], branchRates = morph_branch_rates, siteRates = morph_site_rates, coding = "variable", type = "Standard")
morph[i].clamp ( morph_partitions[i] )
}
Notice that this time we add the argument
coding = "variable"
” to accommodate the fact that we are
only analyzing characters that vary, and type = "Standard"
to communicate to RevBayes that this is categorical morphological data
rather than DNA.
The last thing we need to do is to store our fully specified model,
with its attached data, in an object called my_model
.
my_model = model(timetree)
Now we have fully specified our Bayesian TED model (yey!) we are ready to run our analysis. The code that follows gives directions to RevBayes about how to run our MCMC. For example, for how long (i.e. how many generations), how often and where to store the output, and whether or not to print info in our computer screen as the analysis run.
Before starting to set these parameters, it is worth to mention that you can check this tutorial if you need a refresher as to what an MCMC is and how it works.
From an empiricist point of view, you need to remember that an MCMC (Markov chain Monte Carlo) uses an algorithm (the Metropolis-Hastings algorithm) to propose changes (yes! this is where all those moves you set up come into play) on the parameters of a model. The Metropolis-Hastings algorithm is set up so those proposed changes are always accepted when the proposed change increases the posterior probability of the parameter, and sometimes when it doesn’t. But the most important, is that the MCMC visits the parameter values proportionally to their posterior probability (PP). So the values with higher PP will be visited more often. So when we visualize the density of the values that the MCMC visited (in Tracer for example), we can get an estimate of the parameter value and an error associated to it!!
Now let’s start by specifying where we want to store the results of
the analysis. The files will be saved in a directory called
output/run_1
. Having the extra folder (run_1
)
becomes especially handy when you do multiple runs of one analysis. For
example, maybe you are trying slightly different versions of one
analysis that you want to compare later.
# specifying the output directory
output_dir = "output"
output_extra = "run_1" # change the name as you wish
output_filename = output_dir + "/" + output_extra + "/"
The MCMC will run usually for thousands of generations. We can decide whether or not to store every single parameter value the MCMC visits. Here, we will store the values every 10 generations:
# print generation every 10 iteration
printgen <- 10
We also need to set up monitors (i.e. things that the computer will keep track of). In principle, the MCMC can run without storing any of the states it visits. We need to tell it to print the states in the screen of our computer:
monitors.append( mnScreen(printgen = printgen) )
In a file:
monitors.append( mnModel(filename = output_filename + "params.log", printgen = printgen, exclude = ["F"]) )
And we also want a separate file for the trees:
monitors.append( mnFile(filename = output_filename + "tree.trees", printgen = printgen, timetree) )
Next, we will tell the MCMC for how many generations it will run. One generation roughly corresponds to making each proposal a number of times proportional to its weight. In this example we will run the MCMC for 20000 generations.
ITERATIONS = 20000
RevBayes has the option of running a burnin chain. This chain optimizes the moves on the parameters to achieve mixing.
BURNIN = 1000
Next, we store all the information of our MCMC analysis in one object
called my_mcmc
. Notice that the arguments are the mode, the
monitors, and the moves.
my_mcmc = mcmc(my_model, monitors, moves)
We first run our burnin chain:
my_mcmc.burnin(BURNIN, tuningInterval=5)
And now we run our MCMC (YEY!):
my_mcmc.run(ITERATIONS)
At this point, our analysis will start running and printing in our screen the value for our parameters every 10 generation. If you started your analysis at the beginning of this tutorial by running the main script, it might be done by now. If not, be patient. It may take some time.
The last little piece of code in our main script makes summary trees.
Once the MCMC has finished, we will read in the file that contains the
trees. Notice we specify that these trees are timetrees with the
argument "clock"
.
# Making a summary tree:
# Read trace
full_trees = readTreeTrace(output_filename + "tree.trees", "clock")
We then calculate the Maximum Clade Credibility (MCC) Tree. This is a consensus tree that contains the clades that have the highest support in our posterior sample of trees (yes, in our 2000 sampled trees!).
# Get the Maximum credibility tree (MCC)
mccTree(full_trees, output_filename + "MCC_tree.tre", conditionalAges = TRUE)
We might also want to see the Maximum a Posteriori (MAP) Tree. The tree from the sample that has the highest Posterior Probability. This is not a consensus tree, but a tree from our sample.
# get the Maximum a posteriori tree (MAP)
mapTree(full_trees, output_filename + "MAP_tree.tre")
You might now visualize your trees in FigTree, and your output in Tracer. In the scripts folder, you will also find an R script that uses the R package RevGadgets to produce a plot like the following:
Important note: Remember that before making inferences using the trees that result from your analysis, you need to make sure that the MCMC has converged. If you are not sure what that means, check out this tutorial. Furthermore, even if your MCMC has converged, it only means that your results are accurate given your model and your priors, rather than saying anything about the accuracy of your model. For that you need to compare models and do model adequacy. This statement is not meant to discourage you to apply this type of model on your data! Conversely, it is meant to encourage go to keep exploring these methods in more advanced tutorials where you will learn how to compare models and assess the adequacy of them for your data.
Final notes for your thoughts: For many of the previous notes about what model is better depends on our data. The way to approach these questions is through model comparison (to answer which of competing models is better) and model adequacy (to answer the question of whether our model assumptions are fair).
Acknowledgments: Thanks to Mike May and Carl Rothfels for their continuous mentorship as I navigate phylogenetics!