Fully-attentive iterative networks for region-based controllable image and video captioning

Marcella Cornia, Lorenzo Baraldi, Ayellet Tal, Rita Cucchiara

Research output: Contribution to journalArticlepeer-review


Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

Original languageEnglish
Article number103857
JournalComputer Vision and Image Understanding
StatePublished - Dec 2023


  • Controllable captioning
  • Image captioning
  • Video captioning
  • Vision-and-language

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Fully-attentive iterative networks for region-based controllable image and video captioning'. Together they form a unique fingerprint.

Cite this