Semi-Supervised Linear Regression

David Azriel, Lawrence D. Brown, Michael Sklar, Richard Berk, Andreas Buja, Linda Zhao

Research output: Contribution to journalArticlepeer-review

Abstract

We study a regression problem where for some part of the data we observe both the label variable (Y) and the predictors ((Formula presented.)), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. When the conditional expectation (Formula presented.) is not exactly linear, one can consider the best linear approximation to the conditional expectation, which can be estimated consistently by the least-square estimates (LSE). The latter depends only on the labeled data. We suggest improved alternative estimates to the LSE that use also the unlabeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of (Formula presented.); otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.

Original languageEnglish
Pages (from-to)2238-2251
Number of pages14
JournalJournal of the American Statistical Association
Volume117
Issue number540
DOIs
StatePublished - 2022

Keywords

  • Linear regression
  • Misspecified models
  • Semi-supervised learning

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Fingerprint

Dive into the research topics of 'Semi-Supervised Linear Regression'. Together they form a unique fingerprint.

Cite this