TY - GEN
T1 - Curating Datasets for Better Performance with Example Training Dynamics
AU - Sar-Shalom, Aviad
AU - Schwartz, Roy
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - The landscape of NLP research is dominated by large-scale models training on colossal datasets, relying on data quantity rather than quality. As an alternative to this landscape, we propose a method for weighing the relative importance of examples in a dataset based on their Example Training dynamics (ETD; Swayamdipta et al., 2020), a set of metrics computed during training. We propose a new way of computing the ETD of a dataset, and show that they can be used to improve performance in both in-distribution and out-of-distribution testing. We show that ETD can be transferable, i.e., they can be computed once and used for training different models, effectively reducing their computation cost. Finally, we suggest an active learning approach for computing ETD during training rather than as a preprocessing step-an approach that is not as effective, but dramatically reduces the extra computational costs.
AB - The landscape of NLP research is dominated by large-scale models training on colossal datasets, relying on data quantity rather than quality. As an alternative to this landscape, we propose a method for weighing the relative importance of examples in a dataset based on their Example Training dynamics (ETD; Swayamdipta et al., 2020), a set of metrics computed during training. We propose a new way of computing the ETD of a dataset, and show that they can be used to improve performance in both in-distribution and out-of-distribution testing. We show that ETD can be transferable, i.e., they can be computed once and used for training different models, effectively reducing their computation cost. Finally, we suggest an active learning approach for computing ETD during training rather than as a preprocessing step-an approach that is not as effective, but dramatically reduces the extra computational costs.
UR - http://www.scopus.com/inward/record.url?scp=85174887366&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 10597
EP - 10608
BT - Findings of the Association for Computational Linguistics, ACL 2023
PB - Association for Computational Linguistics (ACL)
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Y2 - 9 July 2023 through 14 July 2023
ER -