TY - GEN
T1 - Small-space and streaming pattern matching with k edits
AU - Kociumaka, Tomasz
AU - Porat, Ely
AU - Starikovskaya, Tatiana
N1 - Publisher Copyright: © 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer k, a pattern p of length m, and a text T of length n≥q m, the task is to find substrings of T that are within edit distance k from p. Our main result is a streaming algorithm that solves the problem in tilde O}(k 5}) space11Hereafter, tilde O() hides a poly} (log n) factor. and tilde O(k 8}) amortized time per character of the text, providing answers correct with high probability. This answers a decade-old question: since the discovery of a poly (k log n)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly (k log n)-space algorithm was known even in the simpler semi-streaming model, where T comes as a stream but p is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. Our central technical contribution is a new space-efficient deterministic encoding of two strings, called the greedy encoding, which encodes a set of all alignments of cost at most k with a certain property (we call such alignments greedy). On strings of length at most n, the encoding occupies tilde O(k 2}) space. We use the encoding to compress substrings of the text that are close to the pattern. In order to do so, we compute the encoding for substrings of the text and of the pattern, which requires read-only access to the latter. In order to develop the fully streaming algorithm, we further introduce a new edit distance sketch parameterized by integers n > k. For any string of length at most n, the sketch is of size tilde Ooverline{(k} 2}), and it can be computed with an tilde O(k 2})-space streaming algorithm. Given the sketches of two strings, in tilde O(k 3}) time we can compute their edit distance or certify that it is larger than k. This result improves upon tilde O(k 8})-size sketches of Belazzougui and Zhang [FOCS 2016] and very recent tilde O(k 3})-size sketches of Jin, Nelson, and Wu [STACS 2021].
AB - In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer k, a pattern p of length m, and a text T of length n≥q m, the task is to find substrings of T that are within edit distance k from p. Our main result is a streaming algorithm that solves the problem in tilde O}(k 5}) space11Hereafter, tilde O() hides a poly} (log n) factor. and tilde O(k 8}) amortized time per character of the text, providing answers correct with high probability. This answers a decade-old question: since the discovery of a poly (k log n)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly (k log n)-space algorithm was known even in the simpler semi-streaming model, where T comes as a stream but p is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. Our central technical contribution is a new space-efficient deterministic encoding of two strings, called the greedy encoding, which encodes a set of all alignments of cost at most k with a certain property (we call such alignments greedy). On strings of length at most n, the encoding occupies tilde O(k 2}) space. We use the encoding to compress substrings of the text that are close to the pattern. In order to do so, we compute the encoding for substrings of the text and of the pattern, which requires read-only access to the latter. In order to develop the fully streaming algorithm, we further introduce a new edit distance sketch parameterized by integers n > k. For any string of length at most n, the sketch is of size tilde Ooverline{(k} 2}), and it can be computed with an tilde O(k 2})-space streaming algorithm. Given the sketches of two strings, in tilde O(k 3}) time we can compute their edit distance or certify that it is larger than k. This result improves upon tilde O(k 8})-size sketches of Belazzougui and Zhang [FOCS 2016] and very recent tilde O(k 3})-size sketches of Jin, Nelson, and Wu [STACS 2021].
KW - edit distance
KW - pattern matching
KW - streaming
UR - http://www.scopus.com/inward/record.url?scp=85127184660&partnerID=8YFLogxK
U2 - 10.1109/FOCS52979.2021.00090
DO - 10.1109/FOCS52979.2021.00090
M3 - منشور من مؤتمر
T3 - Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS
SP - 885
EP - 896
BT - Proceedings - 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science, FOCS 2021
PB - IEEE Computer Society
T2 - 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021
Y2 - 7 February 2022 through 10 February 2022
ER -