TY - GEN
T1 - An Algorithmic Bridge Between Hamming and Levenshtein Distances
AU - Goldenberg, Elazar
AU - Kociumaka, Tomasz
AU - Krauthgamer, Robert
AU - Saha, Barna
N1 - Publisher Copyright: © Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha; licensed under Creative Commons License CC-BY 4.0.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost a1 for a parameter a ≥ 1. This basic variant, denoted EDa, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing EDa interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating EDa? We first present a simple deterministic exact algorithm for EDa and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1 + ϵ)-approximation of EDa(X,Y), given strings X,Y of total length n and a bound k ≥ EDa(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ϵ > 0; then, our algorithm takes Õ(na + ak3) time. Unless a = Õ(1), in which case EDa resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (kI,kS)-alignment, i.e., an alignment with at most kI indels and kS substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ(nkkSI +kSkI3)-time (1,1+ϵ)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for EDa for a = Θ(kkSI ), and its running time is again sublinear in n whenever kI ≪ kS and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1 + ϵ)-approximation in sublinear time, even for a favorable choice of k.
AB - The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost a1 for a parameter a ≥ 1. This basic variant, denoted EDa, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing EDa interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating EDa? We first present a simple deterministic exact algorithm for EDa and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1 + ϵ)-approximation of EDa(X,Y), given strings X,Y of total length n and a bound k ≥ EDa(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ϵ > 0; then, our algorithm takes Õ(na + ak3) time. Unless a = Õ(1), in which case EDa resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (kI,kS)-alignment, i.e., an alignment with at most kI indels and kS substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ(nkkSI +kSkI3)-time (1,1+ϵ)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for EDa for a = Θ(kkSI ), and its running time is again sublinear in n whenever kI ≪ kS and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1 + ϵ)-approximation in sublinear time, even for a favorable choice of k.
UR - http://www.scopus.com/inward/record.url?scp=85147538705&partnerID=8YFLogxK
U2 - https://doi.org/10.4230/LIPIcs.ITCS.2023.58
DO - https://doi.org/10.4230/LIPIcs.ITCS.2023.58
M3 - منشور من مؤتمر
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 14th Innovations in Theoretical Computer Science Conference, ITCS 2023
A2 - Kalai, Yael Tauman
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 14th Innovations in Theoretical Computer Science Conference, ITCS 2023
Y2 - 10 January 2023 through 13 January 2023
ER -