Discovering reliable correlations in categorical data

Panagiotis Mandros, Mario Boley, Jilles Vreeken

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In many scientific tasks we are interested in finding correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on data distribution or the type of correlation, and, how to search efficiently for the most correlated attribute sets. We answer these questions for discovery tasks with categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, in order to obtain a reliable, interpretable, and non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through a case study we confirm that our discovery framework identifies interesting and meaningful correlations.

Original languageAmerican English
Title of host publicationProceedings - 19th IEEE International Conference on Data Mining, ICDM 2019
EditorsJianyong Wang, Kyuseok Shim, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1252-1257
Number of pages6
ISBN (Electronic)9781728146034
DOIs
StatePublished - Nov 2019
Externally publishedYes
Event19th IEEE International Conference on Data Mining, ICDM 2019 - Beijing, China
Duration: 8 Nov 201911 Nov 2019

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2019-November

Conference

Conference19th IEEE International Conference on Data Mining, ICDM 2019
Country/TerritoryChina
CityBeijing
Period8/11/1911/11/19

Keywords

  • Branch-and-bound
  • Information theory
  • Knowledge discovery
  • Optimization
  • Total correlation

All Science Journal Classification (ASJC) codes

  • General Engineering

Cite this