🔥 MM-IFInstruct/DPO Dataset & MM-IFEval Benchmark
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8%), MIA (+7.7%), and IFEval (+10.5%).
Overall Pipeline of MM-IFEngine. Part (a) illustrates the three-stage workflow of our engine: (1) Image Filtering, where irrelevant or low-quality images are removed; (2) Task Generation and Annotation Refinement, where GPT-4o is used to generate tasks for images without QA pairs and refine existing annotations with instructional prompts; and (3) Constraint Integration, which incorporates six main categories and 32 subcategories to ensure compatibility between constraints and tasks. MM-IFEngine is employed to construct both the MM-IF-Dataset and MM-IFEval, as depicted in parts (b) and (c), respectively. MM-IFEval further implements three evaluation metrics that combine rule-based verification functions with a judge model to ensure accurate and reliable assessment.
Main results on Instruction Following benchmarks, including our proposed MM-IFEval, MIA-Bench, and IFEval. The symbolM refers to multi-modal benchmarks, andT denotes text-only benchmarks. We report both compose-level ("C") and perception-level ("P") for MM-IFEval, prompt-level accuracy ("Prompt.") and Inst-level accuracy ("Inst.") for IFEval, and the averaged results across all three benchmarks in the rightmost column.
Performance of existing MLLMs on MM-IFEval. We report the accuracy of easy and difficult problems and the average accuracy across all problems. The C-Level and P-Level refer to the compose-level and perception-level problems, respectively. The best performance in each section is highlighted in bold.