TY - GEN
T1 - Is a picture worth a thousand words? A deep multi-modal architecture for product classification in e-commerce
AU - Zahavy, Tom
AU - Krishnan, Abhinandan
AU - Magnani, Alessandro
AU - Mannor, Shie
N1 - Publisher Copyright: © 2018 Proceedings of the 30th Innovative Applications of Artificial Intelligence Conference, IAAI 2018. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Classifying products precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification based on text and image neural network classifiers. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves classification accuracy over both networks on a real-world largescale product classification dataset that we collected from Walmart.com. While we focus on image-text fusion that characterizes e-commerce businesses, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.
AB - Classifying products precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification based on text and image neural network classifiers. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves classification accuracy over both networks on a real-world largescale product classification dataset that we collected from Walmart.com. While we focus on image-text fusion that characterizes e-commerce businesses, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.
UR - http://www.scopus.com/inward/record.url?scp=85091996402&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Proceedings of the 30th Innovative Applications of Artificial Intelligence Conference, IAAI 2018
SP - 7873
EP - 7880
BT - Proceedings of the 30th Innovative Applications of Artificial Intelligence Conference, IAAI 2018
A2 - Youngblood, G. Michael
A2 - Myers, Karen
PB - The AAAI Press
T2 - 30th Innovative Applications of Artificial Intelligence Conference, IAAI 2018
Y2 - 2 February 2018 through 7 February 2018
ER -