TY - GEN
T1 - Greedy Partition Distance Under Stochastic Models - Analytic Results
AU - Snir, Sagi
N1 - Publisher Copyright: © 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - Gene partitioning is a very common task in genomics, based on several criteria such as gene function, homology, interactions, and more. Given two such partitions, a metric to compare them is called for. One such metric is based on multi symmetric difference and elements are removed from both partitions until identity is reached. While such a task can be done accurately by a maximum weight bipartite matching, in common settings in comparative genomics, the standard algorithm to solve this problem might become impractical. In previous works we have studied the universal pacemaker (UPM) where genes are clustered according to mutation rate correlation, and suggested a very fast and greedy procedure for comparing partitions. This procedure is guaranteed to provide a poor approximation ratio of 1/2 under arbitrary inputs. In this work we give a probabilistic analysis of this procedure under a common and natural stochastic environment. We show that under mild size requirements, and a sound model assumption, this procedure returns the correct result with high probability. Furthermore, we show that in the context of the UPM, this natural requirement holds automatically, rendering statistical consistency of this fast greedy procedure. We also discuss the application of this procedure in the comparative genomics rudimentary task of gene orthology where such a solution is imperative.
AB - Gene partitioning is a very common task in genomics, based on several criteria such as gene function, homology, interactions, and more. Given two such partitions, a metric to compare them is called for. One such metric is based on multi symmetric difference and elements are removed from both partitions until identity is reached. While such a task can be done accurately by a maximum weight bipartite matching, in common settings in comparative genomics, the standard algorithm to solve this problem might become impractical. In previous works we have studied the universal pacemaker (UPM) where genes are clustered according to mutation rate correlation, and suggested a very fast and greedy procedure for comparing partitions. This procedure is guaranteed to provide a poor approximation ratio of 1/2 under arbitrary inputs. In this work we give a probabilistic analysis of this procedure under a common and natural stochastic environment. We show that under mild size requirements, and a sound model assumption, this procedure returns the correct result with high probability. Furthermore, we show that in the context of the UPM, this natural requirement holds automatically, rendering statistical consistency of this fast greedy procedure. We also discuss the application of this procedure in the comparative genomics rudimentary task of gene orthology where such a solution is imperative.
UR - http://www.scopus.com/inward/record.url?scp=85066856166&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-20242-2_22
DO - 10.1007/978-3-030-20242-2_22
M3 - Conference contribution
SN - 9783030202415
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 257
EP - 269
BT - Bioinformatics Research and Applications - 15th International Symposium, ISBRA 2019, Proceedings
A2 - Li, Min
A2 - Cai, Zhipeng
A2 - Skums, Pavel
PB - Springer Verlag
T2 - 15th International Symposium on Bioinformatics Research and Applications, ISBRA 2019
Y2 - 3 June 2019 through 6 June 2019
ER -