TY - GEN
T1 - Discovering Functional Dependencies from Mixed-Type Data
AU - Mandros, Panagiotis
AU - Kaltenpoth, David
AU - Boley, Mario
AU - Vreeken, Jilles
N1 - Publisher Copyright: © 2020 ACM.
PY - 2020/8/23
Y1 - 2020/8/23
N2 - Given complex data collections, practitioners can perform non-parametric functional dependency discovery (FDD) to uncover relationships between variables that were previously unknown. However, known FDD methods are applicable to nominal data, and in practice non-nominal variables are discretized, e.g., in a pre-processing step. This is problematic because, as soon as a mix of discrete and continuous variables is involved, the interaction of discretization with the various dependency measures from the literature is poorly understood. In particular, it is unclear whether a given discretization method even leads to a consistent dependency estimate. In this paper, we analyze these fundamental questions and derive formal criteria as to when a discretization process applied to a mixed set of random variables leads to consistent estimates of mutual information. With these insights, we derive an estimator framework applicable to any task that involves estimating mutual information from multivariate and mixed-type data. Last, we extend with this framework a previously proposed FDD approach for reliable dependencies. Experimental evaluation shows that the derived reliable estimator is both computationally and statistically efficient, and leads to effective FDD algorithms for mixed-type data.
AB - Given complex data collections, practitioners can perform non-parametric functional dependency discovery (FDD) to uncover relationships between variables that were previously unknown. However, known FDD methods are applicable to nominal data, and in practice non-nominal variables are discretized, e.g., in a pre-processing step. This is problematic because, as soon as a mix of discrete and continuous variables is involved, the interaction of discretization with the various dependency measures from the literature is poorly understood. In particular, it is unclear whether a given discretization method even leads to a consistent dependency estimate. In this paper, we analyze these fundamental questions and derive formal criteria as to when a discretization process applied to a mixed set of random variables leads to consistent estimates of mutual information. With these insights, we derive an estimator framework applicable to any task that involves estimating mutual information from multivariate and mixed-type data. Last, we extend with this framework a previously proposed FDD approach for reliable dependencies. Experimental evaluation shows that the derived reliable estimator is both computationally and statistically efficient, and leads to effective FDD algorithms for mixed-type data.
KW - functional dependency discovery
KW - mixed data
KW - mutual information
UR - http://www.scopus.com/inward/record.url?scp=85090413505&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/3394486.3403193
DO - https://doi.org/10.1145/3394486.3403193
M3 - Conference contribution
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1404
EP - 1414
BT - KDD 2020 - Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020
Y2 - 23 August 2020 through 27 August 2020
ER -