TY - GEN
T1 - Data-Driven Bee Identification for DNA Strands
AU - Singhvi, Shubhransh
AU - Boruchovsky, Avital
AU - Kiah, Han Mao
AU - Yaakobi, Eitan
N1 - Publisher Copyright: © 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify M bees, each tagged by a unique barcode, via a set of M noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes N noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems.In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires M2 data comparisons (when M is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task.For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of M2 data comparisons. Specifically, for M = 2n and erasure probability p, the expected number of data comparisons performed by the procedure is ?M2, where (1 + 2p - p2/2)n = ? = (1 + p/2)n.
AB - We study a data-driven approach to the bee identification problem for DNA strands. The bee-identification problem, introduced by Tandon et al. (2019), requires one to identify M bees, each tagged by a unique barcode, via a set of M noisy measurements. Later, Chrisnata et al. (2022) extended the model to case where one observes N noisy measurements of each bee, and applied the model to address the unordered nature of DNA storage systems.In such systems, a unique address is typically prepended to each DNA data block to form a DNA strand, but the address may possibly be corrupted. While clustering is usually used to identify the address of a DNA strand, this requires M2 data comparisons (when M is the number of reads). In contrast, the approach of Chrisnata et al. (2022) avoids data comparisons completely. In this work, we study an intermediate, data-driven approach to this identification task.For the binary erasure channel, we first show that we can almost surely correctly identify all DNA strands under certain mild assumptions. Then we propose a data-driven pruning procedure and demonstrate that on average the procedure uses only a fraction of M2 data comparisons. Specifically, for M = 2n and erasure probability p, the expected number of data comparisons performed by the procedure is ?M2, where (1 + 2p - p2/2)n = ? = (1 + p/2)n.
UR - http://www.scopus.com/inward/record.url?scp=85171442810&partnerID=8YFLogxK
U2 - 10.1109/ISIT54713.2023.10206637
DO - 10.1109/ISIT54713.2023.10206637
M3 - منشور من مؤتمر
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 797
EP - 802
BT - 2023 IEEE International Symposium on Information Theory, ISIT 2023
T2 - 2023 IEEE International Symposium on Information Theory, ISIT 2023
Y2 - 25 June 2023 through 30 June 2023
ER -