Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the L2 max margin vector as O(1/log(t)) for almost all separable datasets, and the loss converges as O(1/t) - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.

Original languageEnglish
Title of host publicationStochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
StatePublished - 2019
Event22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019 - Naha, Japan
Duration: 16 Apr 201918 Apr 2019

Conference

Conference22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019
Country/TerritoryJapan
CityNaha
Period16/04/1918/04/19

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate'. Together they form a unique fingerprint.

Cite this