TY - GEN
T1 - Homomorphic fingerprints under misalignments
T2 - 45th Annual ACM Symposium on Theory of Computing, STOC 2013
AU - Andoni, Alexandr
AU - Goldberger, Assaf
AU - McGregor, Andrew
AU - Poraty, Ely
PY - 2013
Y1 - 2013
N2 - Fingerprinting is a widely-used technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the\ dissimilarity"of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a fie given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n) · polylog n dimensions where D(n) ↞ n is the number of divisors of n. The striking fact is that this is near-optimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a long-standing open problem on the low distortion embeddings of edit distance [36, Question 2.15], [24], for the case of linear embeddings.
AB - Fingerprinting is a widely-used technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the\ dissimilarity"of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a fie given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n) · polylog n dimensions where D(n) ↞ n is the number of divisors of n. The striking fact is that this is near-optimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a long-standing open problem on the low distortion embeddings of edit distance [36, Question 2.15], [24], for the case of linear embeddings.
KW - Edit distance
KW - Fingerprinting
KW - Lower bounds
KW - Sketching
UR - http://www.scopus.com/inward/record.url?scp=84879823474&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/2488608.2488726
DO - https://doi.org/10.1145/2488608.2488726
M3 - منشور من مؤتمر
SN - 9781450320290
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 931
EP - 940
BT - STOC 2013 - Proceedings of the 2013 ACM Symposium on Theory of Computing
Y2 - 1 June 2013 through 4 June 2013
ER -