Skip to main navigation Skip to search Skip to main content

Evolution of N-Gram Frequencies under Duplication and Substitution Mutations

Hao Lou, Moshe Schwartz, Farzad Farnoud Hassanzadeh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The driving force behind the generation of biological sequences are genomic mutations that shape these sequences throughout their evolutionary history. An understanding of the statistical properties that result from mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two types of mutations, tandem duplication and substitution. These play a critical role in forming tandem repeat regions, which are common features of the genome of many organisms. We provide a stochastic model and, via stochastic approximation, study the behavior of the frequencies of N- grams in resulting sequences. Specifically, we show that N-gram frequencies converge almost surely to a set which we identify as a function of model parameters. From these frequencies, other statistics can be derived. In particular, we present a method for finding upper bounds on entropy.

Original languageAmerican English
Title of host publication2018 IEEE International Symposium on Information Theory, ISIT 2018
Pages2246-2250
Number of pages5
DOIs
StatePublished - 15 Aug 2018
Event2018 IEEE International Symposium on Information Theory, ISIT 2018 - Vail, United States
Duration: 17 Jun 201822 Jun 2018

Publication series

NameIEEE International Symposium on Information Theory - Proceedings
Volume2018-June

Conference

Conference2018 IEEE International Symposium on Information Theory, ISIT 2018
Country/TerritoryUnited States
CityVail
Period17/06/1822/06/18

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Information Systems
  • Modelling and Simulation
  • Applied Mathematics

Cite this