Jump to content

Minimum evolution

From Wikipedia, the free encyclopedia

Minimum evolution is a distance method employed in phylogenetics modeling. It shares with maximum parsimony the aspect of searching for the phylogeny that has the shortest total sum of branch lengths.[1][2]

The theoretical foundations of the minimum evolution (ME) criterion lay in the seminal works of both Kidd and Sgaramella-Zonta (1971)[3] and Rzhetsky and Nei (1993).[4] In these frameworks, the molecular sequences from taxa are replaced by a set of measures of their dissimilarity (i.e., the so-called "evolutionary distances") and a fundamental result states that if such distances were unbiased estimates of the true evolutionary distances from taxa (i.e., the distances that one would obtain if all the molecular data from taxa were available), then the true phylogeny of taxa would have an expected length shorter than any other possible phylogeny T compatible with those distances.

Relationships with and comparison with other methods

[edit]
Comparison Table for Phylogenetic Tree-Building Methods
Principle Usage Case Time Complexity
Minimum Evolution (ME) Searches for the shortest total sum of branch lengths Large datasets where character-based methods would be too demanding NP-hard without utilizing heuristics
Maximum Parsimony Searches for the minimal number of evolutionary steps (i.e. mutations) Similar sequence comparisons NP-hard for large datasets
Neighbor Joining Iteratively joins the shortest branch lengths to construct a single tree Large datasets / Datasets with few informative sites O(N3), can be improved using heuristics
Maximum Likelihood Looks for the most probable function possible given data parameters Large datasets where low bias is important O(N), this improves with larger datasets
Unweighted Pair Group Method with Arithmetic Mean(UPGMA) Creates a cluster of clusters and finds the arthmetic mean of the clusters Initial examination of data O(N2)

Maximum parsimony

[edit]

It is worth noting here a subtle difference between the maximum-parsimony criterion and the ME criterion: while maximum-parsimony is based on an abductive heuristic, i.e., the plausibility of the simplest evolutionary hypothesis of taxa with respect to the more complex ones, the ME criterion is based on Kidd and Sgaramella-Zonta's conjectures that were proven true 22 years later by Rzhetsky and Nei.[4] These mathematical results set the ME criterion free from the Occam's razor principle and confer it a solid theoretical and quantitative basis.

Similarly to ME, maximum parsimony becomes an NP-hard problem when trying to find the optimal tree[5] (that is, the one with the least total character-state changes). This is why heuristics are often utilized in order to select a tree, though this does not guarantee the tree will be an optimal selection for the input dataset. This method is often used when very similar sequences are analyzed, as part of the process is locating informative sites in the sequences where a notable number of substitutions can be found. [6]

Maximum-parsimony criterion, which uses Hamming distance branch lengths, was shown to be statistically inconsistent in 1978. This led to an interest in statistically consistent alternatives such as ME.[7]

Neighbor joining

[edit]

Neighbor joining may be viewed as a greedy heuristic for the balanced minimum evolution (BME) criterion. Saito and Nei's 1987 NJ algorithm far predates the BME criterion of 2000. For two decades, researchers used NJ without a firm theoretical basis for why it works.[8]

While neighbor joining shares the same underlying principle of prioritizing minimal evolutionary steps, it differs in that it is a distance method as opposed to maximum parsimony, which is a character-based method. Distance methods like neighbor joining are often simpler to implement and more efficient, which has led to its popularity for analyzing especially large datasets where computational speed is critical. Neighbor joining is a relatively fast phylogenetic tree-building method, though its worst-case time complexity can still be O(N3) without utilizing heuristic implementations to improve on this.[9] It also considers varying rates of evolution across branches, which many other methods do not account for.

Neighbor joining also is a rather consistent method in that an input distance matrix with little to no errors will often provide an output tree with minimal inaccuracy. However, using simple distance values rather than full sequence information like in maximum parsimony does lend itself to a loss of information due to the simplification of the problem.[10]

Maximum Likelihood

[edit]

Maximum likelihood contrasts itself with Minimum Evolution in the sense of Maximum likelihood is a combination of the testing of the most likely tree to result from the data. However, due to the nature of the mathematics involved it is less accurate with smaller datasets but becomes far less biased as the sample size increases, this is due to due to the error rate being 1/log(n). Minimal evolution is similar but it is less accurate with very large datasets. It is similarly powerful but overall much more complicated compared to UPGMA and other options.[11]

UPGMA

[edit]

UPGMA is a clustering method. It builds a collection of clusters that are then further clustered until the maximum potential cluster is obtained.  This is then worked backwards to determine the relation of the groups. It specifically uses an arithmetic mean enabling a more stable clustering. Overall while it is less powerful compared to any of the other listed comparisons it is far simpler and less complex to create. Minimal Evolution is overall more powerful but also more complicated to set up, and is also NP hard. [12]

Statistical consistency

[edit]

The ME criterion is known to be statistically consistent whenever the branch lengths are estimated via the Ordinary Least-Squares (OLS) or via linear programming.[4][13][14] However, as observed in Rzhetsky & Nei's article, the phylogeny having the minimum length under the OLS branch length estimation model may be characterized, in some circumstance, by negative branch lengths, which unfortunately are empty of biological meaning.[4]

To solve this drawback, Pauplin[15] proposed to replace OLS with a new particular branch length estimation model, known as balanced basic evolution (BME). Richard Desper and Olivier Gascuel[16] showed that the BME branch length estimation model ensures the general statistical consistency of the minimum length phylogeny as well as the non-negativity of its branch lengths, whenever the estimated evolutionary distances from taxa satisfy the triangle inequality.

Le Sy Vinh and Arndt von Haeseler[17] have shown, by means of massive and systematic simulation experiments, that the accuracy of the ME criterion under the BME branch length estimation model is by far the highest in distance methods and not inferior to those of alternative criteria based e.g., on Maximum Likelihood or Bayesian Inference. Moreover, as shown by Daniele Catanzaro, Martin Frohn and Raffaele Pesenti,[18] the minimum length phylogeny under the BME branch length estimation model can be interpreted as the (Pareto optimal) consensus tree between concurrent minimum entropy processes encoded by a forest of n phylogenies rooted on the n analyzed taxa. This particular information theory-based interpretation is conjectured to be shared by all distance methods in phylogenetics.

Francois Denis and Olivier Gascuel[19] proved that the Minimum Evolution principle is not consistent in weighted least squares and generalized least squares. They showed that there was an algorithm that could be used in OLS models where all weights are equal called EDGE_LENGTHS. In this algorithm the lengths of two edges, 1u and 2u can be computed without using distances δij(i,j≠1,2). This property does not hold in WLS models or in the GLS models. Without this property the ME principle is not consistent in the WLS and GLS models.

Algorithmic aspects

[edit]

The "minimum evolution problem" (MEP), in which a minimum-summed-length phylogeny is derived from a set of sequences under the ME criterion, is said to be NP-hard.[20][21] The "balanced minimum evolution problem" (BMEP), which uses the newer BME criterion, is APX-hard.[20]

A number of exact algorithms solving BMEP have been described.[22][23][24][25] The best known exact algorithm[26] remains impractical for more than a dozen taxa, even with multiprocessing.[20] There is only one approximation algorithm with proven error bounds, published in 2012.

In practical use, BMEP is overwhelmingly implemented by heuristic search. The basic, aforementioned neighbor-joining algorithm implements a greedy version of BME.[27]

FastME, the "state-of-the-art",[20] starts with a rough tree then improves it using a set of topological moves such as Nearest Neighbor Interchanges (NNI). Compared to NJ, it is about as fast and more accurate.[28]

FastME operates on the Balanced Minimum Evolution principle, which calculates tree length using a weighted linear function of all pairwise distances. The BME score for a given topology is expressed as:

where represents the evolutionary distance between taxa and , and is a topology-dependent weight that balances each pair’s contribution. This approach enables more accurate reconstructions than greedy algorithms like NJ.

The algorithm improves tree topology through local rearrangements, primarily Subtree Prune and Regraft (SPR) and NNI operations. At each step, it checks if a rearranged tree has a lower BME score. If so, the change is retained. This iterative refinement enables FastME to converge toward near-optimal solutions efficiently, even for large datasets.

Simplified pseudocode of FastME:

Input: Distance matrix D for n taxa
Output: Tree T minimizing BME score

1. Construct initial tree T using Neighbor Joining
2. Repeat until no improvement:
    a. For each possible SPR or NNI move:
        i. Apply the move to produce new tree T'
        ii. Compute BME score of T'
        iii. If score(T') < score(T), set T = T'
3. Return final tree T

Simulations reported by Desper and Gascuel demonstrate that FastME consistently outperforms NJ in terms of topological accuracy, particularly when evolutionary rates vary or distances deviate from strict additivity. It has also been successfully used on datasets with over 1,000 taxa.[29]

Like most distance-based methods, BME assumes that the input distances are additive. When this assumption does not hold—due to noise, unequal rates, or other violations—the resulting trees may still be close to optimal, but accuracy can be affected. Metaheuristics have also been used.[30]

See also

[edit]

References

[edit]
  1. ^ Catanzaro, Daniele (2010). Estimating phylogenies from molecular data, in Mathematical approaches to polymer sequence analysis and related problems. Springer, New York.
  2. ^ Catanzaro D (2009). "The minimum evolution problem: Overview and classification". Networks. 53 (2): 112–125. doi:10.1002/net.20280. S2CID 6018514.
  3. ^ Kidd KK, Sgaramella-Zonta LA (1971). "Phylogenetic analysis: Concepts and methods". American Journal of Human Genetics. 23 (3): 235–252. PMC 1706731. PMID 5089842.
  4. ^ a b c d Rzhetsky A, Nei M (1993). "Theoretical foundations of the minimum evolution method of phylogenetic inference". Molecular Biology and Evolution. 10: 21073–1095.
  5. ^ Day WH (1987). "Computational complexity of inferring phylogenies from dissimilarity matrices". Bulletin of Mathematical Biology. 49 (4): 461–7. doi:10.1007/BF02458863. PMID 3664032.
  6. ^ Zou, Yue; Zhang, Zixuan; Zeng, Yujie; Hu, Hanyue; Hao, Youjin; Huang, Sheng; Li, Bo (May 11, 2024). "Common Methods for Phylogenetic Tree Construction and Their Implementation in R". Bioengineering. 11 (5): 480. doi:10.3390/bioengineering11050480. PMC 11117635. PMID 38790347.
  7. ^ Catanzaro, Daniele; Frohn, Martin; Gascuel, Olivier; Pesenti, Raffaele (July 2022). "A tutorial on the balanced minimum evolution problem". European Journal of Operational Research. 300 (1): 1–19. doi:10.1016/j.ejor.2021.08.004. hdl:10278/3742270.
  8. ^ Gascuel O, Steel M (2006). "Neighbor-joining revealed". Mol Biol Evol. 23 (11): 1997–2000. doi:10.1093/molbev/msl072. PMID 16877499.
  9. ^ Studier, J. A.; Keppler, K. J. (November 1988). "A note on the neighbor-joining algorithm of Saitou and Nei". Molecular Biology and Evolution. 5 (6): 729–31. doi:10.1093/oxfordjournals.molbev.a040527. ISSN 1537-1719. PMID 3221794.
  10. ^ Saitou, Naruya; Nei, Masatoshi (July 1987). "The neighbor-joining method: A new method for reconstructing phylogenetic trees". Molecular Biology and Evolution. 4 (4): 406–425. doi:10.1093/oxfordjournals.molbev.a040454. PMID 3447015.
  11. ^ Felsenstein, Joseph (1973). "Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters". Systematic Zoology. 22 (3): 240–249. doi:10.2307/2412304. ISSN 0039-7989.
  12. ^ Moulton, Vincent; Spillner, Andreas; Wu, Taoyang (2018-04-18). "UPGMA and the normalized equidistant minimum evolution problem". Theoretical Computer Science. 721: 1–15. doi:10.1016/j.tcs.2018.01.022. ISSN 0304-3975.
  13. ^ Desper R, Gascuel O (2005). The minimum evolution distance-based approach to phylogenetic inference in Mathematics of Evolution and Phylogeny. Oxford University Press, New York.
  14. ^ Catanzaro D, Aringhieri R, Di Summa M, Pesenti R (2015). "A branch-price-and-cut algorithm for the minimum evolution problem". European Journal of Operational Research. 244 (3): 753–765. doi:10.1016/j.ejor.2015.02.019. S2CID 1549028.
  15. ^ Pauplin Y (2000). "Direct calculation of a tree length using a distance matrix". Journal of Molecular Evolution. 51 (1): 41–47. Bibcode:2000JMolE..51...41P. doi:10.1007/s002390010065. PMID 10903371. S2CID 8619412.
  16. ^ Desper R, Gascuel O (March 2004). "Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting". Molecular Biology and Evolution. 21 (3): 587–98. doi:10.1093/molbev/msh049. PMID 14694080.
  17. ^ Vihn LS, von Haeseler A (2005). "Shortest triplet clustering: Reconstructing large phylogenies using representative sets". BMC Bioinformatics. 6: 1–14. doi:10.1186/1471-2105-6-92. PMC 1097715. PMID 15819989.
  18. ^ Catanzaro D, Frohn M, Pesenti R (2020). "An information theory perspective on the Balanced Minimum Evolution Problem". Operations Research Letters. 48 (3): 362–367. doi:10.1016/j.orl.2020.04.010. S2CID 218998400.
  19. ^ Denis F, Gascuel O (2003). "On the consistency of the minimum evolution principle of phylogenetic inference". Discrete Applied Mathematics. 127: 63–77. doi:10.1016/S0166-218X(02)00285-8. ISSN 0166-218X.
  20. ^ a b c d Rzhetsky, A., and Nei, M. (1993). Theoretical foundation of the minimum-evolution method of phylogenetic inference. Molecular Biology and Evolution, 10(5), 1073–1095.
  21. ^ Semple, C., and Steel, M. (2003). Phylogenetics. Oxford University Press.
  22. ^ Hickey, G., Dehne, F., Rau-Chaplin, A., & Blouin, C. (2008). Parallel and memory-efficient algorithms for constructing evolutionary trees from biological sequence data. Journal of Parallel and Distributed Computing, 68(4), 505–515.
  23. ^ Kordi, M., & Bansal, M. S. (2015). On the complexity of the Balanced Minimum Evolution Problem. Theoretical Computer Science, 596, 77–90.
  24. ^ Chor, B., Hendy, M. D., & Snir, S. (2006). Minimum-entropy phylogenetic trees. Journal of Computational Biology, 13(5), 1101–1116.
  25. ^ Gascuel, O., & Steel, M. (2006). Neighbor-Joining revealed. Molecular Biology and Evolution, 23(11), 1997–2000.
  26. ^ Kordi, M., & Bansal, M. S. (2015). On the complexity of the Balanced Minimum Evolution Problem. Theoretical Computer Science, 596, 77–90.
  27. ^ Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution, 14(7), 685–695.
  28. ^ Desper, R., & Gascuel, O. (2002). Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology, 9(5), 687–705.
  29. ^ Desper, R., & Gascuel, O. (2005). The Minimum Evolution Distance-Based Approach to Phylogenetic Inference. In Gascuel, O. (Ed.), Mathematics of Evolution and Phylogeny. Oxford University Press, pp. 1–28.
  30. ^ Ali, R. A., Dehne, F., Rau-Chaplin, A., & Sack, J. R. (2009). A novel metaheuristic for constructing large phylogenetic trees. Proceedings of the 2009 ACM Symposium on Applied Computing, 1080–1084.

Further reading

[edit]