What do text-to-image models know about the languages of the world?
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 157-175
Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Text-to-image models use user-generated prompts to produce images. Such text-to-image models as DALL-E 2, Imagen, Stable Diffusion, and Midjourney can generate photorealistic or similar to human-drawn images. Apart from imitating human art, large text-to-image models have learned to produce combinations of pixels reminiscent of captions in natural languages. For example, a generated image might contain a figure of an animal and a symbol combination reminding us of human-readable words in a natural language describing the biological name of this species. Although the words occasionally appearing on generated images can be human-readable, they are not rooted in natural language vocabularies and make no sense to non-linguists. At the same time, we find that semiotic and linguistic analysis of the so-called hidden vocabulary of text-to-image models will contribute to the field of explainable AI and prompt engineering. We can use the results of this analysis to reduce the risks of applying such models in real life problem solving and to detect deepfakes. The proposed study is one of the first attempts at analyzing text-to-image models from the point of view of semiotics and linguistics. Our approach implies prompt engineering, image generation, and comparative analysis. The source code, generated images, and prompts have been made available at https://github.com/vifirsanova/text-to-image-explainable.
@article{ZNSL_2023_529_a10,
     author = {V. Firsanova},
     title = {What do text-to-image models know about the languages of the world?},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {157--175},
     year = {2023},
     volume = {529},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/}
}
TY  - JOUR
AU  - V. Firsanova
TI  - What do text-to-image models know about the languages of the world?
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 157
EP  - 175
VL  - 529
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/
LA  - en
ID  - ZNSL_2023_529_a10
ER  - 
%0 Journal Article
%A V. Firsanova
%T What do text-to-image models know about the languages of the world?
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 157-175
%V 529
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/
%G en
%F ZNSL_2023_529_a10
V. Firsanova. What do text-to-image models know about the languages of the world?. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 157-175. http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/

[1] Twitter, , 2023 (Last accessed 12 Mar 2023) https://twitter.com/BarneyFlames/status/1531736708903051265

[2] S. R. Borgwaldt, T. Joyce, Typology of writing systems, Benjamins Current Topics, 51, John Benjamins Publishing, 2013

[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners”, Advances in Neural Information Processing Systems, 33 (2020), 1877–1901

[4] K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does bert look at? an analysis of bert's attention, 2019, arXiv: 1906.04341

[5] Craiyon, , 2023 (Last accessed 15 Mar 2023) https://www.craiyon.com/

[6] Statistics of common crawl monthly archives, Common Crawl, , 2023 (Last accessed 14 Mar 2023) https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

[7] K. Crowson, Vqgan+clip tutorial, , 2023 (Last accessed 17 Mar 2023) https://colab.research.google.com/github/justinjohn0306/VQGAN-CLIP/blob/main/VQGAN

[8] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, E. Raff, “Vqgan-clip: Open domain image generation and editing with natural language guidance”, European Conference on Computer Vision, Springer, 2022, 88–105

[9] P. T. Daniels, W. Bright, The world's writing systems, Oxford University Press, 1996

[10] G. Daras, A. G. Dimakis, Discovering the hidden vocabulary of dalle-2, 2022, arXiv: 2206.00169

[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv: 1810.04805

[12] P. Esser, R. Rombach, B. Ommer, “Taming transformers for high-resolution image synthesis”, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, 12873–12883

[13] Languages of the world, Ethnologue, , 2023 (Last accessed 14 Mar 2023) https://www.ethnologue.com/

[14] A. I. Forever, Kandinsky 2.0, , 2023 (Last accessed 12 Mar 2023) https://github.com/ai-forever/Kandinsky-2.0

[15] I. Gelb, A study of writing, University of Chicago Press, Chicago, 1952

[16] H. He, X. Chen, C. Wang, J. Liu, B. Du, D. Tao, Yu Qiao, Diff-font: Diffusion model for robust one-shot font generation, 2022, arXiv: 2212.05895

[17] J. Hewitt, C. D. Manning, “A structural probe for finding syntax in word representations”, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, v. 1, Long and Short Papers, 2019, 4129–4138

[18] S. Hochreiter, J. Schmidhuber, “Long short-term memory”, Neural Computation, 9:8 (1997), 1735–1780

[19] Diffusers, HuggingFace, , 2023 (Last accessed 17 Mar 2023) https://github.com/huggingface/diffusers

[20] V, Liu, L. B. Chilton, “Design guidelines for prompt engineering text-to-image generative models”, Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, 1–23

[21] P. Lyu, X. Bai, C. Yao, Z. Zhu, T. Huang, W. Liu, “Auto-encoder guided gan for chinese calligraphy synthesis”, 14th 2017 IAPR International Conference on Document Analysis and Recognition (ICDAR), v. 1, IEEE, 2017, 1095–1100

[22] OpenAI, Chatgpt, , 2023 (Last accessed 12 Mar 2023) https://openai.com/blog/chatgpt

[23] Dall-e, OpenAI, , 2023 (Last accessed 15 Mar 2023) https://labs.openai.com/

[24] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback”, Advances in Neural Information Processing Systems, 35 (2022), 27730–27744

[25] V. A. Plungyan, “Modern linguistic typology”, Herald of the Russian Academy of Sciences, 81 (2011), 101–113

[26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision”, International conference on machine learning, PMLR 2021, 8748–8763

[27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners”, OpenAI Blog, 1:8 (2019), 9

[28] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, Hierarchical text-conditional image generation with clip latents, 2022, arXiv: 2204.06125

[29] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, “High-resolution image synthesis with latent diffusion models”, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, 10684–10695

[30] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. G. Lopes, B. K. Ayan, T. Salimans, et al., “Photorealistic text-to-image diffusion models with deep language understanding”, Advances in Neural Information Processing Systems, 35 (2022), 36479–36494

[31] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. Le Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, 2021, arXiv: 2110.08207

[32] R. Tang, A. Pandey, Z. Jiang, G. Yang, K. Kumar, J. Lin, F. Ture, What the daam: Interpreting stable diffusion using cross attention, 2022, arXiv: 2210.04885

[33] Usage statistics of content languages for websites, W3Techs, , 2023 (Last accessed 14 Mar 2023) https://w3techs.com/technologies/overview/content_language

[34] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, Y. Yang, “Cross-modal contrastive learning for text-to-image generation”, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, 833–842