TY - GEN
T1 - Navigating the Modern Evaluation Landscape
T2 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Tutorial Summaries, LREC-COLING 2024
AU - Choshen, Leshem
AU - Gera, Ariel
AU - Perlitz, Yotam
AU - Shmueli-Scheuer, Michal
AU - Stanovsky, Gabriel
N1 - Publisher Copyright: © 2024 ELRA Language Resource Association.
PY - 2024
Y1 - 2024
N2 - General-Purpose language models have changed the world of natural language processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. This tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. We will contrast new to old approaches, from evaluating on multi-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, we will present a host of open research questions in the field of robsut, efficient, and reliable evaluation.
AB - General-Purpose language models have changed the world of natural language processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. This tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. We will contrast new to old approaches, from evaluating on multi-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, we will present a host of open research questions in the field of robsut, efficient, and reliable evaluation.
KW - Benchmarks
KW - Language models
KW - efficient evaluation
KW - language model as metrics
UR - http://www.scopus.com/inward/record.url?scp=85195146523&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Tutorial Summaries
SP - 19
EP - 25
BT - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Tutorial Summaries
A2 - Klinger, Roman
A2 - Okazaki, Naoaki
Y2 - 20 May 2024 through 25 May 2024
ER -