Abstract
We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with common choice of hyper-parameters, can be Θ̃(d/m + 1/√m), where d is the dimension and m is the sample size. This matches the sample complexity of worst-case empirical risk minimizers. That means that, in contrast with other algorithms, GD has no advantage over naive ERMs. Our bound follows from a new generalization bound that depends on both the dimension as well as the learning rate and number of iterations. Our bound also shows that, for general hyper-parameters, when the dimension is strictly larger than number of samples, T = Ω(1/ε4) iterations are necessary to avoid overfitting. This resolves an open problem by Amir, Koren, and Livni [3], Schliserman, Sherman, and Koren [20], and improves over previous lower bounds that demonstrated that the sample size must be at least square root of the dimension.
Original language | English |
---|---|
Journal | Advances in Neural Information Processing Systems |
Volume | 37 |
State | Published - 2024 |
Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: 9 Dec 2024 → 15 Dec 2024 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Information Systems
- Signal Processing