MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding^1,2*, Shenxi Wu^1,2*, Xiangyu Zhao^2,3, Yuhang Zang^2†,
Haodong Duan², Xiaoyi Dong², Pan Zhang², Yuhang Cao², Dahua Lin^2,4,5, Jiaqi Wang^2,6†,

¹Fudan University ²Shanghai AI Laboratory ³Shanghai Jiao Tong University ⁴The Chinese University of Hong Kong ⁵CPII under InnoHK ⁶Shanghai Innovation Institute

Paper Repository Bench SFT Data DPO Data

🔥 MM-IFInstruct/DPO Dataset & MM-IFEval Benchmark

💡 Abstract

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8%), MIA (+7.7%), and IFEval (+10.5%).

MM-IFEngine: Towards Multimodal Instruction Following

💡 Abstract

🚀 Framework

📊 Dataset Statistics

🔥 MM-IFEval Leaderboard

⚡ Dataset/Benchmark Samples