CLID: Controlled-Length Image Descriptions with Limited Data

Elad Hirsch, Ayellet Tal

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and detailed one. Since existing image captioning datasets contain mostly short captions, generating long captions is challenging. To address the shortage of long training examples, we propose to enrich the dataset with varying-length self-generated captions. These, however, might be of varying quality and are thus unsuitable for conventional training. We introduce a novel training strategy that selects the data points to be used at different times during the training. Our method dramatically improves the length-control abilities, while exhibiting SoTA performance in terms of caption quality. Our approach is general and is shown to be applicable also to paragraph generation. Our code is publicly available 1.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
Pages5519-5529
Number of pages11
ISBN (Electronic)9798350318920
DOIs
StatePublished - 2024
Event2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024 - Waikoloa, United States
Duration: 4 Jan 20248 Jan 2024

Publication series

NameProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024

Conference

Conference2024 IEEE Winter Conference on Applications of Computer Vision, WACV 2024
Country/TerritoryUnited States
CityWaikoloa
Period4/01/248/01/24

Keywords

  • Algorithms
  • Vision + language and/or other modalities

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Computer Science Applications

Cite this