Abstract

What does learning to model relationships between strings teach Large Language Models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.



Vision check-up for LLMs. I. Testing the visual knowledge of Language Models. We suggest a set of tests to check the vision abilities of language models, these include (a) the ability to write code that renders complex visual concepts (b) recognizing visual concepts from code (c) correcting rendering code with text-only self-feedback. II. We test whether LLMs can generate data to train a high-performance vision system that can be used to make semantic judgments on natural images.


Generation: Drawing with Text

We test LLM's abilities to generate visual concepts of increasing complexity via a textual promptcodeimage procedure, and find that LLMs can visualize real-world concepts from across the visual hierarchy. LLMs are capable of generating non-trivial visual compositions; the model composes two unrelated concepts (“car shaped cake”), generates visual phenomena (“blurred image”), and manages to correctly interpret spatial relations (e.g. “a row of bicycles” arranged horizontally).

The Visual Aptitude Dataset. Captions for scenes: (left to right, top to bottom) Chef standing next to a counter with jars; Office with leather couch, surrounded by books; Row of bicycles; Birthday boy with car shape cake & candles; Black & white cat sitting on side of a computer monitor; Couple of men hearding

Textual feedback: Correcting with Text

Improved visual generation with text feedback. The improvement in the visual generation of models due to feedback is oftentimes gradual, with the addition of a few features at a time over the course of the feedback process.

We demonstrate that the visual generation competence of a language model can be improved using text-based corrections. We do this by closing the feedback loop between the LLM and itself. Here, we first use the language model to generate code illustrating a concept. Following that, the model is repeatedly called by conditioning its generation on its previously generated code and prompted to ``improve its generated code''. We find that making such iterative calls to the model results in improved visual depictions.

Learning a Vision System from Text

We study if LLM-generated images could serve as a data source for pre-training vision models and compare them to synthetically generated and natural images. We show it is possible for models trained entirely on procedurally generated data from LLMs to perform semantic judgments on natural images despite never having seen one before.

BibTeX

@inproceedings{sharma2024vision,
      title={A Vision Check-up for Language Models},
      author={Sharma, Pratyusha and Rott Shaham, Tamar and Baradad, Manel and Fu, Stephanie and Rodriguez-Munoz, Adrian and Duggal, Shivam and Isola, Phillip and Torralba, Antonio},
      booktitle={arXiv preprint}
      year={2024}
    }