Abstract
This paper studies the construction of p-values for nonparametric outlier detection, from a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a general framework yielding p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Further, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by experiments on real and simulated data.
| Original language | English |
|---|---|
| Pages (from-to) | 149-178 |
| Number of pages | 30 |
| Journal | Annals of Statistics |
| Volume | 51 |
| Issue number | 1 |
| DOIs | |
| State | Published - Feb 2023 |
Keywords
- Conformal inference
- false discovery rate
- out-of-distribution
- positive dependence
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Statistics, Probability and Uncertainty