Blending of predictions boosts understanding for multimodal advertisements

A. Alekseev; A. Savchenko; E. Tutubalina; E. Myasnikov; S. Nikolenko

A. Alekseev ; A. Savchenko ; E. Tutubalina ; E. Myasnikov ; S. Nikolenko

Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 176-196

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Résumé

The advertising industry employs several content modalities to deliver implied messages: images, videos, text, music, and all of them combined. “Decoding” a message implied by multimodal content often requires both text and visual components. We study the tasks of multimodal symbolism prediction, topic detection, and sentiment type classification. Motivated by the difference in parts of the message conveyed by two modalities in advertisements, we train separate models for images and texts and significantly improve upon current state of the art by blending image- and text-based predictions (with OCR-extracted text), providing a comprehensive experimental validation of our approach.

Export
Comment citer

@article{ZNSL_2023_529_a11,
     author = {A. Alekseev and A. Savchenko and E. Tutubalina and E. Myasnikov and S. Nikolenko},
     title = {Blending of predictions boosts understanding for multimodal advertisements},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {176--196},
     year = {2023},
     volume = {529},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a11/}
}

TY  - JOUR
AU  - A. Alekseev
AU  - A. Savchenko
AU  - E. Tutubalina
AU  - E. Myasnikov
AU  - S. Nikolenko
TI  - Blending of predictions boosts understanding for multimodal advertisements
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 176
EP  - 196
VL  - 529
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a11/
LA  - en
ID  - ZNSL_2023_529_a11
ER  -

%0 Journal Article
%A A. Alekseev
%A A. Savchenko
%A E. Tutubalina
%A E. Myasnikov
%A S. Nikolenko
%T Blending of predictions boosts understanding for multimodal advertisements
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 176-196
%V 529
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a11/
%G en
%F ZNSL_2023_529_a11

A. Alekseev; A. Savchenko; E. Tutubalina; E. Myasnikov; S. Nikolenko. Blending of predictions boosts understanding for multimodal advertisements. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 176-196. http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a11/

Bibliographie
Cité par

[1] K. Ahuja, K. Sikka, A. Roy, A. Divakaran, Understanding visual ads by aligning symbols and objects using co-attention, 2018, arXiv: 1807.01448

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C Lawrence Zitnick, and Devi Parikh, “VQA: Visual question answering”, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, 2425–2433

[3] T. Baltrušaitis, C. Ahuja, L.-P. Morency, “Multimodal machine learning: A survey and taxonomy”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2 (2018), 423–443

[4] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, “Enriching word vectors with subword information”, Transactions of the Association for Computational Linguistics, 5 (2017), 135–146

[5] F. de Saussure, Course in general linguistics, trans. Roy Harris, Duckworth, London, 1983

[6] P. Demochkina, A. V. Savchenko, “MobileEmotiFace: Efficient facial image representations in video-based emotion recognition on mobile devices”, Proceedings of Pattern Recognition, ICPR International Workshops and Challenges, Proceedings (Virtual Event, January 10–15, 2021), v. V, Springer, 2021, 266–274

[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018

[8] A. U. Dey, Su. K. Ghosh, E. Valveny, Don't only feel read: Using scene text to understand advertisements, 2018, arXiv: 1806.08279

[9] A. U. Dey, S. K. Ghosh, E. Valveny, G. Harit, Beyond visual semantics: Exploring the role of scene text in image understanding, 2019

[10] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 770–778

[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017, arXiv: 1704.04861

[12] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, Ad. Kovashka, “Automatic understanding of image and video advertisements”, Conference on Computer Vision and Pattern Recognition (CVPR), 2017, 1705–1715

[13] V. V. Ivanov, E. V. Tutubalina, N. R. Mingazov, I. S. Alimova, “Extracting aspects, sentiment and categories of aspects in user reviews about restaurants and cars”, Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, 2015, 22–33

[14] S. Jabeen, X. Li, M. S. Amin, O. Bourahla, S. Li, A. Jabbar, “A review on methods and applications in multimodal deep learning”, ACM Transactions on Multimedia Computing, Communications and Applications, 19:2s (2023), 1–41

[15] EasyOCR: Ready-to-use ocr with 70+ languages supported including chinese, japanese, korean and thai, , JaidedAI, 2020 https://github.com/JaidedAI/EasyOCR

[16] K. Kalra, B. Kurma, S. V. Sreelatha, M. Patwardhan, S. Karande, “Understanding advertisements with BERT”, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, 7542–7547

[17] A. Karpov, I. Makarov, “Exploring efficiency of vision transformers for self-supervised monocular depth estimation”, Proceedings of International Symposium on Mixed and Augmented Reality (ISMAR), IEEE, 2022, 711–719

[18] Ya. I. Khokhlova, A. V. Savchenko, “About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition systems”, Optical Memory and Neural Networks, 23 (2014), 34–42

[19] D. Kiela, S. Bhooshan, H. Firooz, D. Testuggine, Supervised multimodal bitransformers for classifying images and text, 2019, arXiv: 1909.02950

[20] D. P. Kingma, J. Ba, International Conference on Learning Representations (ICLR), 2015

[21] L. Kopeykina, A. V. Savchenko, “Automatic privacy detection in scanned document images based on deep neural networks”, Proceedings of the International Russian Automation Conference (RusAutoCon), IEEE, 2019, 1–6

[22] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, VisualBERT: A simple and performant baseline for vision and language, 2019, arXiv: 1908.03557

[23] P. P. Liang, Z. Liu, Y.-H. H. Tsai, Q. Zhao, R. Salakhutdinov, L.-P. Morency, “Learning representations from imperfect time series data via tensor rank regularization”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 1569–1576

[24] P. P. Liang, Z. Liu, AmirAli Bagher Zadeh, L.-P. Morency, “Multimodal language analysis with recurrent multistage fusion”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, 150–161

[25] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y, Zhang, Z. Shi, J. Fan, Z, He, “A survey of visual transformers”, IEEE Transactions on Neural Networks and Learning Systems, 2023

[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, 2019

[27] D. McDuff, R. El Kaliouby, J. F. Cohn, R. W. Picard, “Predicting ad liking and purchase intent: Large-scale analysis of facial responses to ads”, IEEE Transactions on Affective Computing, 6:3 (2014), 223–235

[28] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013, arXiv: 1301.3781

[29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality”, Advances in neural information processing systems, 2013, 3111–3119

[30] S. Mishra, M. Verma, Y. Zhou, K. Thadani, W. Wang, “Learning to create better ads: Generation and ranking approaches for ad creative refinement”, CIKM '20: Proceedings of the 29th ACM International Conference on Information Knowledge Management, 2020, 2653–2660

[31] L. C. Olson, C. A. Finnegan, D. S. Hope, Visual rhetoric: A reader in communication and american culture, Sage, 2008

[32] OpenAI, GPT-4 technical report, 2023, arXiv: 2303.08774

[33] M. Otani, Y. Iwazaki, K. Yamaguchi, Unreasonable effectiveness of OCR in visual advertisement understanding, 2018

[34] R. Panda, J. Zhang, H. Li, J.-Y. Lee, X. Lu, A. K. Roy-Chowdhury, “Contemplating visual emotions: Understanding and overcoming dataset bias”, Proceedings of European Conference on Computer Vision, ECCV, eds. Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, Springer International Publishing, Cham, 2018, 594–612

[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python”, J Machine Learning Research, 12 (2011), 2825–2830

[36] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models”, Proceedings of the IEEE international conference on computer vision, 2015, 2641–2649

[37] K. Poels, S. Dewitte, “How to capture the heart? reviewing 20 years of emotion measurement in advertising”, J. Advertising Research, 46:1 (2006), 18–37

[38] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S.Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision”, Proceedings of International Conference on Machine Learning (ICML), PMLR, 2021, 8748–8763

[39] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, “Robust speech recognition via large-scale weak supervision”, Proceedings of International Conference on Machine Learning (ICML), PMLR, 2023, 28492–28518

[40] T. Rajapakse, Simple transformers, 2020

[41] N. Rusnachenko, N. Loukachevitch, E. Tutubalina, “Distant supervision for sentiment attitude extraction”, Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2019 (Varna, Bulgaria, September 2019), eds. Ruslan Mitkov and Galia Angelova, INCOMA Ltd., 2019, 1022–1030

[42] A. Sakhovskiy, Z. Miftahutdinov, E. Tutubalina, “Kfu nlp team at smm4h 2021 tasks: Cross-lingual and cross-modal bert-based models for adverse drug effects”, Proceedings of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task, 2021, 39–43

[43] A. Sakhovskiy, E. Tutubalina, “Multimodal model with text and drug embeddings for adverse drug reaction classification”, J. Biomedical Informatics, 135 (2022), 104182

[44] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks”, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018, 4510–4520

[45] A. Savchenko, “Facial expression recognition with adaptive frame rate based on multiple testing correction”, International Conference on Machine Learning, PMLR, 2023, 30119–30129

[46] A. Savchenko, A. Alekseev, S. Kwon, E. Tutubalina, E. Myasnikov, S. Nikolenko, “Ad lingua: Text classification improves symbolism prediction in image advertisements”, Proceedings of the 28th International Conference on Computational Linguistics, 2020, 1886–1892

[47] A. V. Savchenko, “MT-EmotiEffNet for multi-task human affective behavior analysis and learning from synthetic data”, Proceedings of European Conference on Computer Vision Workshops, ECCVW, Springer, 2022, 45–59

[48] A. V. Savchenko, K. V. Demochkin, I. S. Grechikhin, “Preference prediction based on a photo gallery analysis with scene recognition and object detection”, Pattern Recognition, 121 (2022), 108248

[49] V. V. Savchenko, A. V. Savchenko, “Criterion of significance level for selection of order of spectral estimation of entropy maximum”, Radioelectronics and Communications Systems, 62:5 (2019), 223–231

[50] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, D. Parikh, MMF: A multimodal framework for vision and language research, , 2020 https://github.com/facebookresearch/mmf

[51] R. Smith, “An overview of the Tesseract OCR engine”, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, v. 2, IEEE, 2007, 629–633

[52] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, “A corpus for reasoning about natural language grounded in photographs”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 6418–6428

[53] M. Tan, Q. V. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks”, Proceedings of the 36th International Conference on Machine Learning, 2019, 6105–6114

[54] Y.-H. H.Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, R. Salakhutdinov, “Learning factorized multimodal representations”, International Conference on Learning Representations (ICLR), 2018

[55] E. Tutubalina, S. Nikolenko, “Inferring sentiment-based priors in topic models”, Mexican International Conference on Artificial Intelligence, Springer, 2015, 92–104

[56] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, S. Shao, Shape robust text detection with progressive scale expansion network, 2019

[57] J. Williamson, Decoding advertisement, 1978

[58] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, Rémi Louf, M. Funtowicz, J. Brew, Huggingface's transformers: State-of-the-art natural language processing, 2019, arXiv: 1910.03771

[59] L. Xiao, X. Li, Y. Zhang, “Exploring the factors influencing consumer engagement behavior regarding short-form video advertising: A big data perspective”, J. Retailing and Consumer Services, 70 (2023), 103170

[60] L. Xing, Z. Tian, W. Huang, M. R, Scott, “Convolutional character networks”, Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, 9126–9136

[61] J. Yang, Q. Huang, T. Ding, D. Lischinski, D. Cohen-Or, H. Huang, “EmoSet: A large-scale visual emotion dataset with rich attributes”, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, 20383–20394

[62] K. Ye, N. Honarvar Nazari, J. Hahn, Z. Hussain, M. Zhang, and A. Kovashka, “Interpreting the rhetoric of visual advertisements”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 1–1

[63] K. Ye, K. Buettner, A. Kovashka, Story understanding in video advertisements, 2018, arXiv: 1807.11122

[64] K. Ye, A. Kovashka, “Advise: Symbolism and external knowledge for decoding advertisements”, Proceedings of the European Conference on Computer Vision (ECCV), 2018, 837–855

[65] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph”, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, v. 1, Long Papers, 2018, 2236–2246

[66] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, “From recognition to cognition: Visual commonsense reasoning”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 6720–6731

[67] M. Zhang, R. Hwa, A. Kovashka, Equal but not the same: Understanding the implicit relationship between persuasive images and text, 2018, arXiv: 1807.08205

[68] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, “EAST: an efficient and accurate scene text detector”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, 5551–5560

[69] Y. Zhou, S. Mishra, M. Verma, N. Bhamidipati, W. Wang, “Recommending themes for ad creative design via visual-linguistic representations”, Proceedings of The Web Conference (WWW), 2020, 2521–2527

Parcourir par

Geodesic

Parcourir par