A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree (in Bayesian and maximum likelihood approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences. == Introduction == These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to transversions in evolving sequences. However, the Kimura (K80) model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions. Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the Q matrices below). If we are given a starting (ancestral) state at one position, the model's Q matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in the mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons). The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed. This assumption may be justifiable if the sites can be assumed to be evolving neutrally. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites. == DNA evolution as a continuous-time Markov chain == === Continuous-time Markov chains === Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time, t {\displaystyle t} . Specifically, if E 1 , E 2 , E 3 , E 4 {\displaystyle E_{1},E_{2},E_{3},E_{4}} are the states, then the transition matrix P ( t ) = ( P i j ( t ) ) {\displaystyle P(t)={\big (}P_{ij}(t){\big )}} where each individual entry, P i j ( t ) {\displaystyle P_{ij}(t)} refers to the probability that state E i {\displaystyle E_{i}} will change to state E j {\displaystyle E_{j}} in time t {\displaystyle t} . Example: We would like to model the substitution process in DNA sequences (i.e. Jukes–Cantor, Kimura, etc.) in a continuous-time fashion. The corresponding transition matrices will look like: P ( t ) = ( p A A ( t ) p A G ( t ) p A C ( t ) p A T ( t ) p G A ( t ) p G G ( t ) p G C ( t ) p G T ( t ) p C A ( t ) p C G ( t ) p C C ( t ) p C T ( t ) p T A ( t ) p T G ( t ) p T C ( t ) p T T ( t ) ) {\displaystyle P(t)={\begin{pmatrix}p_{\mathrm {AA} }(t)&p_{\mathrm {AG} }(t)&p_{\mathrm {AC} }(t)&p_{\mathrm {AT} }(t)\\p_{\mathrm {GA} }(t)&p_{\mathrm {GG} }(t)&p_{\mathrm {GC} }(t)&p_{\mathrm {GT} }(t)\\p_{\mathrm {CA} }(t)&p_{\mathrm {CG} }(t)&p_{\mathrm {CC} }(t)&p_{\mathrm {CT} }(t)\\p_{\mathrm {TA} }(t)&p_{\mathrm {TG} }(t)&p_{\mathrm {TC} }(t)&p_{\mathrm {TT} }(t)\end{pmatrix}}} where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities. Assumption: If at some time t 0 {\displaystyle t_{0}} , the Markov chain is in state E i {\displaystyle E_{i}} , then the probability that at time t 0 + t {\displaystyle t_{0}+t} , it will be in state E j {\displaystyle E_{j}} depends only upon i {\displaystyle i} , j {\displaystyle j} and t {\displaystyle t} . This then allows us to write that probability as p i j ( t ) {\displaystyle p_{ij}(t)} . Theorem: Continuous-time transition matrices satisfy: P ( t + τ ) = P ( t ) P ( τ ) {\displaystyle P(t+\tau )=P(t)P(\tau )} Note: There is here a possible confusion between two meanings of the word transition. (i) In the context of Markov chains, transition is the general term for the change between two states. (ii) In the context of nucleotide changes in DNA sequences, transition is a specific term for the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about transitions in genetics). By contrast, an exchange between one purine and one pyrimidine is called a transversion. === Deriving the dynamics of substitution === Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let E = { A , G , C , T } {\displaystyle {\mathcal {E}}=\{A,\,G,\,C,\,T\}} be the set of possible states for the site, and p ( t ) = ( p A ( t ) , p G ( t ) , p C ( t ) , p T ( t ) ) {\displaystyle \mathbf {p} (t)=(p_{A}(t),\,p_{G}(t),\,p_{C}(t),\,p_{T}(t))} their respective probabilities at time t {\displaystyle t} . For two distinct x , y ∈ E {\displaystyle x,y\in {\mathcal {E}}} , let μ x y {\displaystyle \mu _{xy}\ } be the transition rate from state x {\displaystyle x} to state y {\displaystyle y} . Similarly, for any x {\displaystyle x} , let the total rate of change from x {\displaystyle x} be μ x = ∑ y ≠ x μ x y . {\displaystyle \mu _{x}=\sum _{y\neq x}\mu _{xy}\,.} The changes in the probability distribution p A ( t ) {\displaystyle p_{A}(t)} for small increments of time Δ t {\displaystyle \Delta t} are given by p A ( t + Δ t ) = p A ( t ) − p A ( t ) μ A Δ t + ∑ x ≠ A p x ( t ) μ x A Δ t . {\displaystyle p_{A}(t+\Delta t)=p_{A}(t)-p_{A}(t)\mu _{A}\Delta t+\sum _{x\neq A}p_{x}(t)\mu _{xA}\Delta t\,.} In other words, (in frequentist language), the frequency of A {\displaystyle A} 's at time t + Δ t {\displaystyle t+\Delta t} is equal to the frequency at time t {\displaystyle t} minus the frequency of the lost A {\displaystyle A} 's plus the frequency of the newly created A {\displaystyle A} 's. Similarly for the probabilities p G ( t ) {\displaystyle p_{G}(t)} , p C ( t ) {\displaystyle p_{C}(t)} and p T ( t ) {\displaystyle p_{T}(t)} . These equations can be written compactly as p ( t + Δ t ) = p ( t ) + p ( t ) Q Δ t , {\displaystyle \mathbf {p} (t+\Delta t)=\mathbf {p} (t)+\mathbf {p} (t)Q\Delta t\,,} where Q = ( − μ A μ A G μ A C μ A T μ G A − μ G μ G C μ G T μ C A μ C G − μ C μ C T μ T A μ T G μ T C − μ T ) {\displaystyle Q={\begin{pmatrix}-\mu _{A}&\mu _{AG}&\mu _{AC}&\mu _{AT}\\\mu _{GA}&-\mu _{G}&\mu _{GC}&\mu _{GT}\\\mu _{CA}&\mu _{CG}&-\mu _{C}&\mu _{CT}\\\mu _{TA}&\mu _{TG}&\mu _{TC}&-\mu _{T}\end{pmatrix}}} is known as the rate matrix. Note that, by definition, the sum of the entries in each row of Q {\displaystyle Q} is equal to zero. It follows that p ′ ( t ) = p ( t ) Q . {\displaystyle \mathbf {p} '(t)=\mathbf {p} (t)Q\,.} For a stationary process, where Q {\displaystyle Q} does not depend on time t, this differential equation can be solved. First, P ( t ) = exp ( t Q ) , {\displaystyle P(t)=\exp(tQ),} where exp ( t Q ) {\displaystyle \exp(tQ)} denotes the exponential of the matrix t Q {\displaystyle tQ} . As a result, p ( t ) = p ( 0 ) P ( t ) = p ( 0 ) exp ( t Q ) . {\displaystyle \mathbf {p} (t)=\mathbf {p} (0)P(t)=\mathbf {p} (0)\exp(tQ)\,.} === Ergodicity === If the Markov chain is irreducible, i.e. if it is always possible to go from a state x {\displaystyle x} to a state y {\displaystyle y} (possibly in several steps), then it is also ergodic. As a result, it has a unique stationary distribution π = { π x , x ∈ E } {\displaystyle {\boldsymbol {\pi }}=\{\pi _{x},\,x\in {\mathcal {E}}\}} , where π x {\displaystyle \pi _{x}} corresponds to the proportion of time spent in state x {\displaystyle x} after the Markov chain has run for an infinite amount of time. In DNA evo
Read more →