: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation

ICML 2026

Jinyu Liu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang

Fudan University

2,169Unified task samples
14Unified tasks
88.7%Human alignment
15Models evaluated
Overview

We introduce Unison, a comprehensive benchmark comprising 2,169 high-quality unified task samples, designed to evaluate joint understanding and generation in unified multimodal models. Unison offers three key strengths: 1) Comprehensive Dimensions: Unison encompasses internal consistency, understanding-guided generation, generation-guided understanding, and mutual enhancement to enable holistic evaluation. 2) Diagnostic Evaluation: it provides both unified and decoupled tracks for understanding and generation, allowing fine-grained attribution of failure modes and quantitative analysis of the gains from unified modeling. 3) Human Alignment: we also train Unison-Judge, an evaluation model well aligned with human judgments to achieve reliable assessment. Based on systematic evaluations of state-of-the-art models on Unison, we uncover critical limitations in current unified multimodal systems and highlight promising directions for future research.

Overview of the four Unison evaluation dimensions: internal consistency, understanding-guided generation, generation-guided understanding, and mutual enhancement.
Unison benchmark statistics. Benchmark comparison.
Leaderboard

Scores across the four dimensions, each split into Und. (understanding), Gen. (generation) and Uni. (unified) tracks. Bold = best, underline = second best within open-source models.

Model Params Internal Consistency Und.-Guided Gen. Gen-Guided Und. Mutual Enhancement Overall
Und.Gen.Uni. Und.Gen.Uni. Und.Gen.Uni. Und.Gen.Uni.
Show-o 1.3B 88.364.758.5 8.90 12.0
Janus-Pro 1.5B 94.447.145.0 0.3 19.2
Show-o2 1.5B 96.067.965.8 26.7 9.4
D-DiT 2B 86.565.058.1 0.2 6.8
ILLUME+ 3B 43.419.910.5 10.37.79.0 11.330.115.1 1.05.53.2 9.4
Janus-Pro 7B 95.771.769.8 3.2 15.1
Show-o2 7B 97.273.872.5 9.9 9.2
ILLUME+ 7B 80.220.416.7 12.410.411.4 11.327.713.9 2.76.84.8 11.7
🥈OmniGen2 7B 92.379.074.5 61.342.652.0 19.741.930.9 45.050.347.7 51.3
TokenFlow 14B 93.047.144.5 20.1 17.0
🥇BAGEL 14B 96.082.580.3 57.678.167.9 28.241.632.0 7.257.732.5 53.2
SEED-X 17B 82.838.934.2 18.613.716.1 13.527.420.8 0.216.88.5 19.9
🥉UniWorld-V1 19B 92.668.565.1 63.426.444.9 22.832.026.9 46.416.231.3 42.1
Model Params Internal Consistency Und.-Guided Gen. Gen-Guided Und. Mutual Enhancement Overall
Und.Gen.Uni. Und.Gen.Uni. Und.Gen.Uni. Und.Gen.Uni.
Gemini 3 Pro 98.388.186.9 71.082.876.9 42.246.543.9 65.377.471.4 69.8
GPT-5.2 98.686.384.7 69.785.777.7 44.858.252.7 69.171.270.2 71.3

Higher is better. Dashes mark tracks a model does not support. Closed-source bold marks the higher of the two systems.

Visualizations
Citation

If you find this work useful, please consider citing:

@inproceedings{liu2026unison,
  title     = {Unison: Benchmarking Unified Multimodal Models via
               Synergistic Understanding and Generation},
  author    = {Liu, Jinyu and Shuai, Xincheng and Ding, Henghui and Jiang, Yu-Gang},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}