InfoCapability

Vision Language Models Explained — Overview of Architectures and Models

AI Impact Summary

This document provides a high-level overview of Vision Language Models (VLMs), focusing on their architecture, training methods, and available models. The content highlights the increasing diversity within VLMs, including open-source options like LLaVA, KOSMOS-2, and Fuyu-8B, and emphasizes techniques like pre-training with GPT-4 and fine-tuning with SFTTrainer. The document also details key benchmarks like MMMU and MMBench used for evaluating VLM capabilities, particularly in areas like visual question answering and reasoning.

Affected Systems

LLaVAKOSMOS-2

Date: Date not specified
Change type: capability
Severity: info

Vision Language Models Explained — Overview of Architectures and Models

More from Hugging Face

Get alerts for Hugging Face