MENTOR: Efficient Multimodal‑Conditioned Tuning
for Autoregressive Vision Generation Models

1University of Illinois Urbana‑Champaign 2University of Wisconsin‑Madison 3Tsinghua University 4Peking University 5Adobe Research 6Microsoft

🔥 News

  • 🚀 2025‑06‑02 — Initial release of MENTOR, a lightweight yet state‑of‑the‑art multimodal‑conditioned image generator.

🧠 Overview

CP·PF score comparison
MENTOR vs baselines on DreamBench++. Circle size = CP·PF score.

MENTOR, a lightweight autoregressive (AR) framework for controllable multimodal image generation. Unlike diffusion models that rely on stochastic sampling and heavy training, MENTOR leverages a unified transformer to directly align multimodal inputs with output image tokens, enabling precise, token-level control with significantly fewer resources.

Our two-stage training ensures balanced fusion across modalities and avoids over-reliance on either text or visual cues.

  • 🚀 10× less data & fewer GPU hours than diffusion-based models
  • 📈 SOTA on DreamBench++ with CP·PF score of 0.47
  • 🎯 Effective for diverse downstream tasks: image reconstruction, subject-driven generation, multimodal ICL

Code, data, and models available at: github.com/HaozheZhao/MENTOR

🌟 Highlights

MetricDiffusion SOTAMENTOR
CP·PF ↑ (DreamBench++)0.36 (Emu2)0.47
Training data16‑200 M pairs3 M
GPU budget256 GPU × 3 days(Kosmos-G)8 GPU × 1.5 days
Image reconstruction ℓ2 ↓0.206 (DreamEngine)0.101

Smaller is better for ℓ2; higher is better for CP·PF.

✨ Why MENTOR?

Token‑level control — deterministic AR decoding avoids stochastic diffusion noise, enabling precise layout & identity preservation.

Balanced multimodal fusion — two‑stage tuning prevents over‑reliance on either text or image, yielding the lowest CP/PF imbalance among baselines. fileciteturn1file4

Training‑friendly — 10× less data and fewer GPU hours than diffusion counterparts.

🔍 Method — Two‑Stage Multimodal‑Conditioned Tuning

MENTOR method diagram

Two-Stage Training Paradigm

Stage Tasks Objective
Stage 1
Alignment
  • • Image Reconstruction
  • • Object Segmentation
  • • Text-to-Image (T2I)
Enhance pixel and semantic alignment
Stage 2
Instruction Tuning
  • • Image Recovery
  • • Subject-Driven Generation
Achieve robust and balanced multimodal integration

Data Construction

MENTOR data_construct

We construct a large-scale multimodal dataset comprising approximately 3 million samples across all training tasks through an automated pipeline that combines state-of-the-art language-vision models with segmentation models. The dataset integrates:

  • Open-source resources (CC12M, Midjourney-Niji)
  • Synthetic data generated using T2I models (Flux.1, Stable Diffusion v3.5)
  • Automated annotations for fine-grained object-level tasks
  • Subject-driven data from OminiControl dataset with re-captioning

📈 Performance Results

On DreamBench++, MENTOR outperforms diffusion‑based baselines such as Emu2 and DreamEngine by ≈ 30% CP·PF while using a tenth of their training data.

Main Results on DreamBench++

Method T2I Model Train Data Concept Preservation (CP) Prompt Following (PF) CP·PF ↑ CP/PF ↓
AnimalHumanObjectStyleOverall Photo.Style.Imag.Overall
Test-Time Tuning-Free Methods
Unified-IO2Unified-IO28.5B 0.770.800.640.820.72 0.240.180.110.19 0.143.79
Lumina-mGPTChameleon10M 0.950.970.890.850.91 0.310.250.150.25 0.233.64
DreamEngineSD3.521M 0.760.720.610.730.68 0.440.370.250.37 0.261.84
BLIP-DiffusionSD v1.5130M 0.670.560.470.510.55 0.580.510.300.50 0.271.10
Emu2SDXL v1.016M 0.670.550.450.450.53 0.730.720.560.69 0.360.77
IP-AdapterSDXL v1.010M 0.670.560.500.750.59 0.740.630.450.64 0.380.92
MENTORLlamaGen3M 0.650.360.570.470.55 0.860.850.800.84 0.470.65

Key Findings

  • Highest CP·PF Score: MENTOR achieves 0.47, significantly outperforming all test-time tuning-free baselines
  • Balanced Performance: Lowest CP/PF ratio (0.65) indicates excellent balance between concept preservation and prompt following
  • Data Efficiency: Trained on only 3M pairs vs. 16-200M for competitors
  • Outperforms Larger Models: Surpasses Emu2 (37B params) and DreamEngine (10.5B params) despite smaller size

Image Reconstruction Performance

Method COCO L2 ↓ JourneyDB L2 ↓
SeedTokenizer 0.5102 0.5291
SEED-X 0.4317 0.4352
EMU2-Gen 0.3828 0.2869
DreamEngine 0.2065 0.2052
MENTOR 0.1008 0.0867

Lower is better. L2 distance measures pixel-level reconstruction error.

Ablation Study Results

Configuration CP PF CP·PF
w/o Stage 1 Alignment 0.179 0.673 0.120
w/o Object Segmentation (Stage 1) 0.252 0.479 0.121
w/o Image Recovery 0.661 0.284 0.188
w/o Object Segmentation (Stage 2) 0.412 0.918 0.378
w/o Multimodal T2I Task 0.407 0.910 0.370
MENTOR (Full Model) 0.555 0.839 0.466

🎯 Versatile Applications

MENTOR demonstrates broad adaptability across diverse multimodal generation tasks with minimal fine-tuning. Our framework's versatility stems from its unified autoregressive approach and robust two-stage training paradigm.

MENTOR versatile applications showcase
Versatile applications built on MENTOR after simple fine-tuning on corresponding datasets, including image segmentation, multi-image generation, subject-driven image generation, and multimodal in-context learning image generation.

Supported Applications

🎭 Image Segmentation

Segmentation Example

End-to-end object segmentation with high spatial precision, enabled by Stage 1 training.

🖼️ Multi-Image Generation

Multi-Image Generation

Generate coherent outputs from multiple reference images, demonstrating multimodal alignment.

🧠 In-Context Learning

In-Context Learning Example

Adapt to unseen generation tasks from few multimodal examples, showing strong generalization.

📝 Key Insight: While achieving state-of-the-art performance in each specialized domain would require more targeted training and potentially more powerful components, these initial results underscore MENTOR's versatility and potential as an effective foundation for diverse multimodal conditional image generation applications.

📚 Citation

If you find MENTOR useful, please cite:

@misc{zhao2025mentorefficientmultimodalconditionedtuning,
      title={MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models}, 
      author={Haozhe Zhao and Zefan Cai and Shuzheng Si and Liang Chen and Jiuxiang Gu and Wen Xiao and Junjie Hu},
      year={2025},
      eprint={2507.09574},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.09574}, 
}
}