TY - GEN
T1 - Joint optimization for cooperative image captioning
AU - Vered, Gilad
AU - Oren, Gal
AU - Atzmon, Yuval
AU - Chechik, Gal
N1 - Publisher Copyright: © 2019 IEEE.
PY - 2019/10
Y1 - 2019/10
N2 - When describing images with natural language, descriptions can be made more informative if tuned for downstream tasks. This can be achieved by training two networks: A 'speaker' that generates sentences given an image and a 'listener' that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. To address these challenges, we present an effective optimization technique based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. We then show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the COCO benchmark show that PSST improve the recall@10 from 60% to 86% maintaining comparable language naturalness. Human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.
AB - When describing images with natural language, descriptions can be made more informative if tuned for downstream tasks. This can be achieved by training two networks: A 'speaker' that generates sentences given an image and a 'listener' that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. To address these challenges, we present an effective optimization technique based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. We then show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the COCO benchmark show that PSST improve the recall@10 from 60% to 86% maintaining comparable language naturalness. Human evaluations show that it also increases naturalness while keeping the discriminative power of generated captions.
UR - http://www.scopus.com/inward/record.url?scp=85081914971&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/iccv.2019.00899
DO - https://doi.org/10.1109/iccv.2019.00899
M3 - منشور من مؤتمر
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 8897
EP - 8906
BT - Proceedings - 2019 International Conference on Computer Vision, ICCV 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE/CVF International Conference on Computer Vision, ICCV 2019
Y2 - 27 October 2019 through 2 November 2019
ER -