ICML 2026 · Main Conference
A cognitively-grounded benchmark probing how Omni-MLLMs perceive, understand, reason about, and adapt to the audio-visual world.
Abstract
Current evaluations of Omni-MLLMs rely on isolated tasks without examining the relationships between them, limiting deeper assessment of audio-visual capabilities. We propose AVI-Bench, a cognitively-inspired framework built around three cognitively grounded stages — Perception, Understanding, Reasoning — together with the Primitive Sensation (PriSe) extension for unfamiliar-domain evaluation.
The benchmark contains 5,864 samples across 14 tasks, scored by 9 metrics and summarised through a four-level AVI taxonomy. Our findings reveal that reasoning abilities are fundamentally constrained by perception and understanding, and that current Omni-MLLMs generalise poorly across unfamiliar audio-visual domains.
The Four-Level Taxonomy
Each level is a strictly harder test of cognitive integration — from single-task competence to robust adaptation across modalities, cognitive stages, and unfamiliar domains.
Average competence across all audio-visual tasks — the baseline.
Balance between audio-dominant and visual-dominant tasks.
Reasoning grounded in its perceptual and conceptual prerequisites.
Robust performance across both familiar and unfamiliar domains.
Benchmark Design
AVI-Bench mirrors the human cognitive ladder — perception, understanding, reasoning — and adds Primitive Sensation (PriSe): low-semantic, out-of-distribution stimuli that test whether models generalise beyond their training prior.
Results
| Model | Params | Perc. | Underst. | Reason. | Sens. | Overall |
|---|---|---|---|---|---|---|
| Gemini-2.5-Pro | — | 54.58 | 68.97 | 69.06 | 36.22 | 57.21 |
| Gemini-2.5-Flash | — | 45.97 | 43.79 | 63.70 | 30.63 | 46.02 |
| Gemini-2.0-Flash | — | 44.27 | 42.11 | 64.03 | 29.48 | 44.97 |
| Qwen2.5-Omni | 7B | 42.81 | 39.68 | 58.26 | 24.59 | 41.33 |
| GPT-4o | — | 40.45 | 48.60 | 56.87 | 16.81 | 40.68 |
| Human (subset) | — | — | — | — | 90+ | 92.6 |
| Model | Params | L1 Task | L2 Modality | L3 Stage | L4 Domain |
|---|---|---|---|---|---|
| Gemini-2.5-Pro | — | 64.20 | 62.80 | 57.08 | 32.97 |
| Gemini-2.5-Flash | — | 51.15 | 48.58 | 40.47 | 27.72 |
| Gemini-2.0-Flash | — | 50.14 | 49.21 | 39.79 | 27.12 |
| Qwen-Omni-Turbo | 7B | 46.50 | 45.15 | 37.70 | 26.13 |
| Qwen2.5-Omni | 7B | 46.92 | 45.93 | 37.61 | 25.89 |
| Baichuan-Omni | 7B | 37.35 | 35.80 | 30.18 | 24.10 |
| GPT-4o | — | 48.64 | 47.19 | 41.93 | 00.55 |
Task Samples
One representative example per cognitive stage. Click play to listen and watch.
Audio–Visual Matching
QAre the contexts of audio and visual content matching?
Ano
Audio–Visual Localization
QGenerate bounding boxes of sound-emitting instances in the image, conditioned on the audio.
A{"tuba_1": [153, 4, 107, 210], "tuba_2": [323, 57, 73, 155]}
Audio–Visual Captioning
QDescribe what you see and hear in a single sentence.
AA baby is laughing and smiling while her mother is lying down next to her, talking to her.
Audio-referenced Visual Retrieval
QIdentify which images contain objects corresponding to the sound.
A[] (no match in the candidate set)
Audio-referenced Visual Hallucination
QIs the car visible in the video?
Ayes
Audio–Visual Question Answering
QIs there a voiceover?
Ayes
Audio–Visual Sensation QA
QWhich object produces sound first?
Acube
Visual Sensation QA (video)
QWhich object's area in the video has not changed?
Asquare
Quickstart
# 1. Setup conda create -n avibench python=3.11 -y && conda activate avibench pip install -r requirements.txt # 2. Configure your OpenAI-compatible gateway export OPENAI_API_KEY=... OPENAI_BASE_URL=... export DATA_ROOT=/path/to/AVIBench_data_release/levels # 3. Inference -> Refine -> Evaluate bash run_all.sh cd auto_format && python run.py && cd .. cd eval && python eval.py --models gemini-2.5-pro
Citation
@inproceedings{wang2026avibench,
title = {AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs},
author = {Wang, Yaoting and Zhang, Ziyi and Tu, Wenming and Xu, Shaoxuan and
Du, Wenjie and Liang, Cheng and Wang, Weijun and Li, Yuanchao and
Li, Guangyao and Fei, Hao and Li, Yuanchun and Ding, Henghui and
Liu, Yunxin},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026}
}