Abstract
Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.
Original language | English |
---|---|
Article number | 103857 |
Journal | Computer Vision and Image Understanding |
Volume | 237 |
DOIs | |
State | Published - Dec 2023 |
Keywords
- Controllable captioning
- Image captioning
- Video captioning
- Vision-and-language
All Science Journal Classification (ASJC) codes
- Software
- Signal Processing
- Computer Vision and Pattern Recognition