Modern multimodal AI models such as CLIP and BLIP use attention mechanisms — the core idea behind Transformers to learn relationships between words and visual elements. In both the vision and text ...