What does learning to model relationships between strings teach Large Language Models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.
We test LLM's abilities to generate visual concepts of increasing complexity via a textual prompt → code → image procedure, and find that LLMs can visualize real-world concepts from across the visual hierarchy. LLMs are capable of generating non-trivial visual compositions; the model composes two unrelated concepts (“car shaped cake”), generates visual phenomena (“blurred image”), and manages to correctly interpret spatial relations (e.g. “a row of bicycles” arranged horizontally).
We demonstrate that the visual generation competence of a language model can be improved using text-based corrections. We do this by closing the feedback loop between the LLM and itself. Here, we first use the language model to generate code illustrating a concept. Following that, the model is repeatedly called by conditioning its generation on its previously generated code and prompted to ``improve its generated code''. We find that making such iterative calls to the model results in improved visual depictions.
We study if LLM-generated images could serve as a data source for pre-training vision models and compare them to synthetically generated and natural images. We show it is possible for models trained entirely on procedurally generated data from LLMs to perform semantic judgments on natural images despite never having seen one before.
@inproceedings{sharma2024vision,
title={A Vision Check-up for Language Models},
author={Sharma, Pratyusha and Rott Shaham, Tamar and Baradad, Manel and Fu, Stephanie and Rodriguez-Munoz, Adrian and Duggal, Shivam and Isola, Phillip and Torralba, Antonio},
booktitle={arXiv preprint}
year={2024}
}