Abstract
We build statistical models to describe the substitution process in the SARS-CoV-2 as a function of explanatory factors describing the sequence, its function, and more. These models serve two different purposes: first, to gain knowledge about the evolutionary biology of the virus; and second, to predict future mutations in the virus, in particular, non-synonymous amino acid substitutions creating new variants. We use tens of thousands of publicly available SARS-CoV-2 sequences and consider tens of thousands of candidate models. Through a careful validation process, we confirm that our chosen models are indeed able to predict new amino acid substitutions: candidates ranked high by our model are eight times more likely to occur than random amino acid changes. We also show that named variants were highly ranked by our models before their appearance, emphasizing the value of our models for identifying likely variants and potentially utilizing this knowledge in vaccine design and other aspects of the ongoing battle against COVID-19.
Original language | English |
---|---|
Article number | 285 |
Journal | Communications Biology |
Volume | 5 |
Issue number | 1 |
DOIs | |
State | Published - Dec 2022 |
All Science Journal Classification (ASJC) codes
- General Biochemistry,Genetics and Molecular Biology
- General Agricultural and Biological Sciences
- Medicine (miscellaneous)