Multimodal capabilities of LLMs

Multimodal capabilities of LLMs refer to their ability to understand and generate content in multiple modes beyond just text.

Dec 04, 2023

A recent update for ChatGPT allows us to send an image as part of a prompt, so now we can ask questions based on the image provided.

GPT-4V is not the only model on the market, there are others - Fuyu-8B, LLaVA, BakLLaVA, Qwen-VL, PALM-e, PALI-X, Flamingo and CogVLM, to name a few. These models add all or some of these new capabilities:

Visual question answering
Image captioning
Visual instruction following
Multimodal translation
Chart Understanding
Document Understanding
Diagram Understanding
OCR
Fine-grained localization of text and UI elements within those images
Answer questions about images of UIs

The vital question to solve for such models is how to add an image to LLM. There are two approaches: have a specific visual encoder or connect image patches directly to the first layer of transformers in LLM. For instance, LLaVA uses pre-trained CLIP ViT-L/14 as an encoder and Vicuna as LLM, whereas Fuyu-8B doesn't have a visual encoder.

CLIP is a neural network which maps text and images to a shared embedding space, so from an LLM perspective there is no difference between text and image. CLIP has a cross-attention layer for mapping to work. As a result, images must be of a certain size. Images of bigger size need to be down-sampled, and padded or distorted depending of the aspect ratio.

It's interesting to see how multiple models solves complex problems. Right now, ChatGPT can not create a box or a mask around an object on a source image, but it can be solved with the additional model. Multimodal-maestro library uses GroundingDINO and LLM to solve this problem. At first, it adds numerical labels to objects on the image, then asks LLM to find an object, and the last step is to draw a mask around it.

Adding visuals to LLMs is a huge step forward. Image processing allows us to digitize the real world. In other words, converts information about the real world to a structured form.

References:

https://openai.com/research/gpt-4v-system-card - GPT-4V(ision) system card
https://www.adept.ai/blog/fuyu-8b - Fuyu-8B: A Multimodal Architecture for AI Agents
https://blog.roboflow.com/gpt-4-vision-alternatives/ - GPT-4 Vision Alternatives
https://microsoft.github.io/autogen/blog/2023/11/06/LMM-Agent - Multimodal with GPT-4V and LLaVA
https://github.com/roboflow/multimodal-maestro - Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA.
https://openai.com/research/clip - CLIP: Connecting text and images

Shchegrikovich LLM

Discussion about this post