What do text-to-image models know about the languages of the world?
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 157-175

Voir la notice de l'article provenant de la source Math-Net.Ru

Text-to-image models use user-generated prompts to produce images. Such text-to-image models as DALL-E 2, Imagen, Stable Diffusion, and Midjourney can generate photorealistic or similar to human-drawn images. Apart from imitating human art, large text-to-image models have learned to produce combinations of pixels reminiscent of captions in natural languages. For example, a generated image might contain a figure of an animal and a symbol combination reminding us of human-readable words in a natural language describing the biological name of this species. Although the words occasionally appearing on generated images can be human-readable, they are not rooted in natural language vocabularies and make no sense to non-linguists. At the same time, we find that semiotic and linguistic analysis of the so-called hidden vocabulary of text-to-image models will contribute to the field of explainable AI and prompt engineering. We can use the results of this analysis to reduce the risks of applying such models in real life problem solving and to detect deepfakes. The proposed study is one of the first attempts at analyzing text-to-image models from the point of view of semiotics and linguistics. Our approach implies prompt engineering, image generation, and comparative analysis. The source code, generated images, and prompts have been made available at https://github.com/vifirsanova/text-to-image-explainable.
@article{ZNSL_2023_529_a10,
     author = {V. Firsanova},
     title = {What do text-to-image models know about the languages of the world?},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {157--175},
     publisher = {mathdoc},
     volume = {529},
     year = {2023},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/}
}
TY  - JOUR
AU  - V. Firsanova
TI  - What do text-to-image models know about the languages of the world?
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 157
EP  - 175
VL  - 529
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/
LA  - en
ID  - ZNSL_2023_529_a10
ER  - 
%0 Journal Article
%A V. Firsanova
%T What do text-to-image models know about the languages of the world?
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 157-175
%V 529
%I mathdoc
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/
%G en
%F ZNSL_2023_529_a10
V. Firsanova. What do text-to-image models know about the languages of the world?. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 157-175. http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a10/