TY - GEN
T1 - Repairing Databases over Metric Spaces with Coincidence Constraints
AU - Kaminsky, Youri
AU - Kimelfeld, Benny
AU - Livshits, Ester
AU - Naumann, Felix
AU - Wajc, David
N1 - Publisher Copyright: © Youri Kaminsky, Benny Kimelfeld, Ester Livshits, Felix Naumann, and David Wajc.
PY - 2025/3/21
Y1 - 2025/3/21
N2 - Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a vector space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as coincidence constraints, which include unary key constraints, inclusion constraints, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where we limit the allowed change of each individual value. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.
AB - Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a vector space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as coincidence constraints, which include unary key constraints, inclusion constraints, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set. We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where we limit the allowed change of each individual value. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.
KW - coincidence constraints
KW - Database repairs
KW - foreign-key constraints
KW - inclusion constraints
KW - metric spaces
UR - http://www.scopus.com/inward/record.url?scp=105001547573&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.ICDT.2025.14
DO - 10.4230/LIPIcs.ICDT.2025.14
M3 - منشور من مؤتمر
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 28th International Conference on Database Theory, ICDT 2025
A2 - Roy, Sudeepa
A2 - Kara, Ahmet
T2 - 28th International Conference on Database Theory, ICDT 2025
Y2 - 25 March 2025 through 28 March 2025
ER -