TY - GEN
T1 - Untying Rates of Gene Gain and Loss Leads to a New Phylogenetic Approach
AU - Dvir, Yoav
AU - Snir, Sagi
N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - The advent of the genomic era has produced an incredible wealth and resolution of molecular data, posing an unprecedented challenge for molecular systematics, necessitating novel techniques and paradigms. Consequently, whole genome approaches were developed to extract the evolutionary signal by taking advantage of a larger amount of data. In parallel and in light of the understanding that in prokaryotes, genome dynamics (GD) events, primarily gene gain and loss, provide a significantly richer signal than point mutations in ubiquitous housekeeping genes, GD-based approaches were suggested. However, proper modeling of these data and the processes generating them has lagged in their pace of accumulation, both because of a lack of deep understanding and because of technical difficulties. Among the central hurdles of accurate modeling of real data is the relaxation of rate constancy, particularly the untying of gain and loss rates. This relaxation violates key assumptions such as constant genome sizes, gene set, and model reversibility and has vast implications for implementation. This work presents a generic stochastic model, the two-ratio process (TRP), which encompasses and deals with these complications. As a special case, it contains the Poissonian process with different gene gain and loss rates as a form of the Birth-Death process with varying population sizes. The lack of reversibility invalidates traditional phylogenetic approaches, yielding a novel two-stage phylogenetic approach in which accurate, bidirectional parameters are first inferred for triplets and later combined by a special cherry-picking method to a complete tree. We show by algebraic techniques that this method is theoretically statistically consistent. The method implemented by the software TDDR (Triplets Directed Distances Reconstruction) was applied to synthetic data, showing an advantage over other approaches handling similar data but without the same model assumption. We also applied it to the Alignable Tight Genomic Clusters (ATGC) Database, which showed a high adequacy to the observed data. The full text of this article appears on bioRxiv.org at https://www.biorxiv.org/content/10.1101/2025.01.27.634999v1. The TDDR code is available on GitHub: https://github.com/YoavDvir/TDDR.
AB - The advent of the genomic era has produced an incredible wealth and resolution of molecular data, posing an unprecedented challenge for molecular systematics, necessitating novel techniques and paradigms. Consequently, whole genome approaches were developed to extract the evolutionary signal by taking advantage of a larger amount of data. In parallel and in light of the understanding that in prokaryotes, genome dynamics (GD) events, primarily gene gain and loss, provide a significantly richer signal than point mutations in ubiquitous housekeeping genes, GD-based approaches were suggested. However, proper modeling of these data and the processes generating them has lagged in their pace of accumulation, both because of a lack of deep understanding and because of technical difficulties. Among the central hurdles of accurate modeling of real data is the relaxation of rate constancy, particularly the untying of gain and loss rates. This relaxation violates key assumptions such as constant genome sizes, gene set, and model reversibility and has vast implications for implementation. This work presents a generic stochastic model, the two-ratio process (TRP), which encompasses and deals with these complications. As a special case, it contains the Poissonian process with different gene gain and loss rates as a form of the Birth-Death process with varying population sizes. The lack of reversibility invalidates traditional phylogenetic approaches, yielding a novel two-stage phylogenetic approach in which accurate, bidirectional parameters are first inferred for triplets and later combined by a special cherry-picking method to a complete tree. We show by algebraic techniques that this method is theoretically statistically consistent. The method implemented by the software TDDR (Triplets Directed Distances Reconstruction) was applied to synthetic data, showing an advantage over other approaches handling similar data but without the same model assumption. We also applied it to the Alignable Tight Genomic Clusters (ATGC) Database, which showed a high adequacy to the observed data. The full text of this article appears on bioRxiv.org at https://www.biorxiv.org/content/10.1101/2025.01.27.634999v1. The TDDR code is available on GitHub: https://github.com/YoavDvir/TDDR.
KW - Birth-Death Processes
KW - Phylogenetics
KW - Prokaryotic Genome Dynamics
KW - Statistical Consistency
UR - http://www.scopus.com/inward/record.url?scp=105004255738&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-90252-9_51
DO - 10.1007/978-3-031-90252-9_51
M3 - Conference contribution
SN - 9783031902512
T3 - Lecture Notes in Computer Science
SP - 414
EP - 419
BT - Research in Computational Molecular Biology - 29th International Conference, RECOMB 2025, Proceedings
A2 - Sankararaman, Sriram
PB - Springer Science and Business Media Deutschland GmbH
T2 - 29th International Conference on Research in Computational Molecular Biology, RECOMB 2025
Y2 - 26 April 2025 through 29 April 2025
ER -