A vision-language model (VLM) is a type of artificial intelligence system that can jointly interpret and generate information from both images and text, extending the capabilities of large language models (LLMs), which are limited to text. It is an example of multimodal learning.

Many widely used commercial applications now rely on this ability. OpenAI introduced computer vision