Leakage in data mining: Formulation, detection, and avoidance

Shachar Kaufman, Saharon Rosset, Claudia Perlich

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competi-tions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by expli-citly defining modeling goals and analyzing the broader frame-work of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.

Original languageEnglish
Title of host publicationProceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
Pages556-563
Number of pages8
DOIs
StatePublished - 2011
Event17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011 - San Diego, United States
Duration: 21 Aug 201124 Aug 2011

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011
Country/TerritoryUnited States
CitySan Diego
Period21/08/1124/08/11

Keywords

  • Data mining
  • Leakage
  • Predictive modeling
  • Statistical inference

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Leakage in data mining: Formulation, detection, and avoidance'. Together they form a unique fingerprint.

Cite this