Falcon Perception: unified early-fusion Transformer for open-vocabulary grounding and segmentation
AI Impact Summary
Falcon Perception bundles image and text into a single 0.6B-parameter Transformer with a hybrid attention mask, enabling open-vocabulary grounding and dense segmentation via a compact Chain-of-Perception interface (<coord> → <size> → <seg>). This design eliminates the traditional vision backbone + late fusion, potentially reducing latency and simplifying attribution of improvements, while introducing PBench to diagnose capabilities across OCR, spatial reasoning, and relations. The release pairs Falcon Perception with Falcon OCR and demonstrates ensemble validation against SAM 3 and other models (Qwen3-VL-30B, Moondream3), underscoring a move toward unified backbones and capability-aware benchmarking for open-vocabulary perception tasks.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info