AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Wang, Yaoting

ICML 2026 · Main Conference

Toward Human-like
Audio-Visual Intelligence
of Omni-MLLMs

A cognitively-grounded benchmark probing how Omni-MLLMs perceive, understand, reason about, and adapt to the audio-visual world.

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang,
Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding , Yunxin Liu

Paper Code Dataset

Abstract

Current evaluations of Omni-MLLMs rely on isolated tasks without examining the relationships between them, limiting deeper assessment of audio-visual capabilities. We propose AVI-Bench, a cognitively-inspired framework built around three cognitively grounded stages — Perception, Understanding, Reasoning — together with the Primitive Sensation (PriSe) extension for unfamiliar-domain evaluation.

The benchmark contains 5,864 samples across 14 tasks, scored by 9 metrics and summarised through a four-level AVI taxonomy. Our findings reveal that reasoning abilities are fundamentally constrained by perception and understanding, and that current Omni-MLLMs generalise poorly across unfamiliar audio-visual domains.

The Four-Level Taxonomy

Four orders of audio-visual intelligence

Each level is a strictly harder test of cognitive integration — from single-task competence to robust adaptation across modalities, cognitive stages, and unfamiliar domains.

Task-Adaptive

Average competence across all audio-visual tasks — the baseline.

Modality-Adaptive

Balance between audio-dominant and visual-dominant tasks.

III

Stage-Adaptive

Reasoning grounded in its perceptual and conceptual prerequisites.

Domain-Adaptive

Robust performance across both familiar and unfamiliar domains.

Benchmark Design

Three cognitive stages, plus a stress test

AVI-Bench framework: Perception, Understanding, Reasoning, and PriSe stages

Cognitive scaffolding

AVI-Bench mirrors the human cognitive ladder — perception, understanding, reasoning — and adds Primitive Sensation (PriSe): low-semantic, out-of-distribution stimuli that test whether models generalise beyond their training prior.

5,864samples

14tasks

9metrics

Usage Policy

For evaluation only — not for training

AVI-Bench is released under a dual licence: code under MIT, and the dataset under the AVI-Bench Data Use Policy v1.0 (CC BY-ND 4.0 with an Anti-Training Addendum). Using the dataset — in whole or in part — to train, fine-tune, distil, align, or otherwise update any machine-learning model is expressly prohibited. This includes LLMs, VLMs, audio-language models, omni-modal foundation models, diffusion models, and any subsequent model class. Bulk redistribution and automated scraping are also prohibited. Commercial evaluation and benchmarking are permitted.

Conventional and AI-search retrieval are welcome: this page is indexable by Google, Bing, and AI chat retrieval agents so that researchers can discover and cite the work. Training crawlers are blocked via robots.txt and HTML meta directives.

Read full policy Licence file

Results

A wide human – model gap remains

Per-stage performance

28 Omni-MLLMs · mean accuracy

Model	Params	Perc.	Underst.	Reason.	Sens.	Overall
Gemini-2.5-Pro	—	54.58	68.97	69.06	36.22	57.21
Gemini-2.5-Flash	—	45.97	43.79	63.70	30.63	46.02
Gemini-2.0-Flash	—	44.27	42.11	64.03	29.48	44.97
Qwen2.5-Omni	7B	42.81	39.68	58.26	24.59	41.33
GPT-4o	—	40.45	48.60	56.87	16.81	40.68
Human (subset)	—	—	—	—	90+	92.6

Four-level AVI taxonomy

L1 → L4 · bottleneck-aware

Model	Params	L1 Task	L2 Modality	L3 Stage	L4 Domain
Gemini-2.5-Pro	—	64.20	62.80	57.08	32.97
Gemini-2.5-Flash	—	51.15	48.58	40.47	27.72
Gemini-2.0-Flash	—	50.14	49.21	39.79	27.12
Qwen-Omni-Turbo	7B	46.50	45.15	37.70	26.13
Qwen2.5-Omni	7B	46.92	45.93	37.61	25.89
Baichuan-Omni	7B	37.35	35.80	30.18	24.10
GPT-4o	—	48.64	47.19	41.93	00.55

The top model, Gemini-2.5-Pro, reaches only 57.2 overall — far from the human baseline (~92.6). AVI-Bench remains an unsolved frontier.

Visual-dominant tasks consistently outperform audio-dominant ones, confirming a persistent modality imbalance.

iii

Primitive Sensation is the weakest stage across the board — the taxonomy makes the gap explicit, e.g. GPT-4o ranks 7th on L1 yet collapses to L4 = 0.55 due to near-zero performance on audio-only sensation tasks.

Task Samples

A look inside the benchmark

One representative example per cognitive stage. Click play to listen and watch.

AVM

Audio–Visual Matching

QAre the contexts of audio and visual content matching?

Ano

AVL

Audio–Visual Localization

QGenerate bounding boxes of sound-emitting instances in the image, conditioned on the audio.

A{"tuba_1": [153, 4, 107, 210], "tuba_2": [323, 57, 73, 155]}

AVC

Audio–Visual Captioning

QDescribe what you see and hear in a single sentence.

AA baby is laughing and smiling while her mother is lying down next to her, talking to her.

AVR

Audio-referenced Visual Retrieval

QIdentify which images contain objects corresponding to the sound.

A[] (no match in the candidate set)

AVH

Audio-referenced Visual Hallucination

QIs the car visible in the video?

Ayes

AVQA

Audio–Visual Question Answering

QIs there a voiceover?

Ayes

AVSQA

Audio–Visual Sensation QA

QWhich object produces sound first?

Acube

VSQA

Visual Sensation QA (video)

QWhich object's area in the video has not changed?

Asquare

Quickstart

Evaluate a model in three steps

# 1. Setup
conda create -n avibench python=3.11 -y && conda activate avibench
pip install -r requirements.txt

# 2. Configure your OpenAI-compatible gateway
export OPENAI_API_KEY=... OPENAI_BASE_URL=...
export DATA_ROOT=/path/to/AVIBench_data_release/levels

# 3. Inference -> Refine -> Evaluate
bash run_all.sh
cd auto_format && python run.py && cd ..
cd eval && python eval.py --models gemini-2.5-pro

Citation

If AVI-Bench helps your work

@inproceedings{wang2026avibench,
  title     = {AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs},
  author    = {Wang, Yaoting and Zhang, Ziyi and Tu, Wenming and Xu, Shaoxuan and
               Du, Wenjie and Liang, Cheng and Wang, Weijun and Li, Yuanchao and
               Li, Guangyao and Fei, Hao and Li, Yuanchun and Ding, Henghui and
               Liu, Yunxin},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}

Toward Human-like Audio-Visual Intelligence of Omni-MLLMs