Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

Abstract

Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.

⚖️ Flex-Judge Examples

🎬 Evaluate Video Generation Task!

Example 1

👤 User

Generate a video "a donut on the top of broccoli, front view".

🤖 Assistant A

🤖 Assistant B

Judge

<think>
Assistant A's video is quite accurate in terms of the visual representation of the prompt "a donut on the top of broccoli, front view." Here's a detailed breakdown of how the video aligns with the prompt:

1. Alignment with Prompt: The video clearly shows a donut placed on top of a broccoli floret, which aligns well with the description. The broccoli is in the foreground, and the donut is placed on top, which is consistent with the front view requirement. The donut is also decorated with blue icing, which might be a creative interpretation, but it does not detract from the accuracy in terms of the basic elements (broccoli and donut).

2. Naturalness and Visual Quality: The broccoli and the donut are both rendered in a way that is somewhat realistic, though the blue icing on the donut might be seen as a creative or artistic choice. The broccoli appears to be a standard broccoli floret, and the donut is depicted in a way that is consistent with typical donut designs, which adds to the naturalness and visual quality of the scene.

3. Consistency and Coherence: The video maintains a consistent and coherent scene, with the broccoli and donut clearly visible and the donut positioned correctly on top of the broccoli. This coherence is important in maintaining the clarity of the visual representation.

4. Dynamic and Visual Appealing: The video does not show any dynamic movement, but the still image effectively captures the essence of the prompt. The visual appeal is quite good, with the blue icing on the donut adding a vibrant contrast to the green broccoli, which makes the scene visually interesting.

In contrast, let's consider what might be wrong with the video from Assistant B:

- Alignment with Prompt: The video from Assistant B does not clearly show a donut on top of broccoli. Instead, it appears to show a donut with a broccoli-like texture or a broccoli-shaped donut, which might be a creative interpretation, but it does not align well with the typical understanding of a donut on broccoli. This could be seen as less accurate in terms of the prompt.

- Naturalness and Visual Quality: The broccoli in the video from Assistant B might be interpreted as a broccoli-shaped donut, which might be a creative interpretation, but it does not clearly represent a broccoli in the traditional sense. This could be seen as less natural in the context of the prompt.

- Consistency and Coherence: The video from Assistant B might be less coherent in terms of the prompt, as it does not clearly show a donut on top of broccoli. This could lead to confusion or misinterpretation.

- Dynamic and Visual Appealing: The video from Assistant B might be less visually appealing because the broccoli-like texture on the donut might be less clear or less realistic, which could detract from the overall visual quality.

Given these considerations, the video from Assistant A is more accurate in aligning with the prompt, and it maintains a consistent and coherent visual representation, which might be more aligned with the typical understanding of a donut on broccoli. Therefore, it would be more accurate to say that Assistant A's video is better in terms of the prompt's requirements. </think>

<answer>[[A>B]]</answer>

🔊 Evaluate Audio MOS Score!

Example 1

👤 User

Generate clear, natural, and understandable high-quality speech audio.

🤖 Assistant

Ground-truth MOS (1-5 scale): 4.3

Judge

<think>
Assistant's speech is quite clear and understandable, which is a significant improvement over a very bad or poor quality speech. However, the speech is somewhat artificial in tone, which might be a result of the AI's attempt to generate natural speech. Here's a detailed breakdown:

1. Clarity and Understandability: The speech provided by the AI is clear and understandable. Each phrase is distinct and easy to follow, which is a good sign. This clarity is a crucial aspect in evaluating the quality of speech generation.

2. Naturalness: While the speech is clear, it does not sound entirely natural. The tone and intonation might be slightly robotic or overly mechanical, which can be a sign that the AI might be struggling to fully mimic human speech patterns. This could be due to the limitations in the AI's ability to fully understand and replicate the nuances of human speech.

3. High-Quality Audio: The audio quality seems to be good, as the speech is clear and the words are distinct. There are no noticeable distortions or background noises that would detract from the clarity.

4. Content: The content of the speech is somewhat abstract and might not be immediately relatable to a typical conversation or narrative. This could be a result of the AI's attempt to generate a specific scenario or narrative, which might not be the most natural or relatable to a human audience.

Given these observations, the speech quality is quite good, but the naturalness could be improved. The clarity and understandability are strong points, but the slight artificiality in the tone might be a drawback. </think>

<answer>4.5</answer>

Example 2

👤 User

Generate clear, natural, and understandable high-quality speech audio.

🤖 Assistant

(The audio is very quiet. We recommend volume up.)

Ground-truth MOS (1-5 scale): 1.2

Judge

<think>
Assistant's answer is significantly flawed in terms of clarity, naturalness, and audio quality, which are the key aspects of the question asked. Here's a detailed breakdown of the issues:

1. Clarity and Naturalness:
- The speech generated by the AI is highly distorted and noisy, making it difficult to understand any specific words or phrases. This is evident from the presence of loud, sharp sounds that do not align with typical speech patterns.
- The phrase "rice of third film" does not make sense in a natural or clear context, which suggests that the AI might have misinterpreted or generated nonsensical words.

2. Audio Quality:
- The audio quality is extremely poor, with a lot of noise and what appears to be a distorted or malfunctioning sound system. This kind of audio would be difficult to listen to and would likely be considered very bad in terms of audio quality standards.

3. Relevance and Accuracy:
- The content "rice of third film" does not seem relevant or accurate to the question asked, which was to generate clear, natural, and understandable speech. This indicates a significant deviation from the intended task, leading to an inaccurate or irrelevant response.

Given these issues, it is clear that the speech generated by the AI does not meet the criteria set in the question. The audio is neither clear, natural, nor of high quality, and the content is nonsensical, which would significantly impact the overall score. </think>

<answer>1.0</answer>

🎙️ Evaluate Speaker Similarity!

Example 1

👤 User

Evaluate if these two speech audios share the same speaker.

Ground-truth SS Score (1-6 scale): 1.5

Ground-truth Speaker Match: Not matching

🤖 Assistant

The given two audios are from the same speaker.

Judge

<think>
Assistant's answer is incorrect. To accurately determine if the two speech audios are from the same speaker, we need to carefully analyze the characteristics such as vocal quality, tone, pitch, accent, and intonation, which are the key factors in distinguishing between different speakers.

Step 1: Analyze Vocal Quality
- Audio 1: The speaker in the first audio has a clear, steady voice with a moderate pitch and a consistent tone. The speech is delivered in a way that suggests a professional or formal setting, which might be indicative of someone who is accustomed to speaking in public or in a professional context.
- Audio 2: The speaker in the second audio has a more casual tone, and the speech is delivered in a more informal manner. This could suggest a different level of comfort or familiarity with the topic being discussed, which might be different from the first audio.

Step 2: Analyze Tone and Pitch
- Tone: The tone in the first audio is steady and professional, which might be expected from someone who is confident and experienced in speaking in a public or professional setting. In contrast, the tone in the second audio is more casual, which might be expected from someone who is more comfortable in a personal or informal setting.
- Pitch: The pitch in both audios seems to be within a similar range, but the tone and delivery might differ significantly, which can be a key factor in distinguishing between speakers.

Step 3: Analyze Accent and Intonation
- Accent: Both audios do not provide clear evidence of a specific accent, but the way the words are pronounced and the intonation might differ. The first audio might have a more formal or professional intonation, while the second audio might have a more casual or informal intonation.
- Intonation: The intonation in the first audio is consistent with a professional or formal speech, whereas the second audio might have a more relaxed or casual intonation, which could be indicative of a different speaker.

Conclusion:
Given the differences in tone, pitch, and the overall style of speech, it is likely that the two audios are from different speakers. The first audio suggests a more professional or formal speaker, while the second audio suggests a more casual or informal speaker. Therefore, the correct answer would be that the two audios are not from the same speaker. </think>

<answer>2.5</answer>

Example 2

👤 User

Evaluate if these two speech audios share the same speaker.

Ground-truth SS Score (1-6 scale): 4.5

Ground-truth Speaker Match: Matching

🤖 Assistant

The given two audios are from the same speaker.

Judge

<think>
Assistant's answer is accurate in identifying that the two audio clips are from the same speaker. Here's a detailed analysis of why this conclusion is correct:

1. Voice Characteristics: Both audio clips feature a female voice, which is a clear indication that the speaker is the same person. The voice quality, tone, and pitch are consistent throughout both clips, suggesting that the speaker is the same individual.

2. Speech Style and Content: The speech content in both clips is consistent with a narrative or descriptive style, which is typical in interviews or personal stories. Both speakers are discussing personal experiences and future plans, which aligns well with the context provided in the speech subtitles.

3. Emotional Tone: The emotional tone in both clips is similar, reflecting a mix of excitement and nervousness. This consistency in emotional tone can be a strong indicator that the same person is speaking, as the emotional state often reflects the speaker's personal feelings and experiences.

4. Accent and Intonation: Both audio clips exhibit a similar accent and intonation, which can be a significant factor in identifying the same speaker. The way the speaker articulates words and the rhythm of speech can be quite distinctive and can help in recognizing the same person.

Given these observations, it is reasonable to conclude that the two audio clips are indeed from the same speaker, as the voice characteristics, speech style, emotional tone, and accent all align consistently across both clips. </think>

<answer>4.8</answer>

📖 Refer to our paper for more examples across different modalities and evaluation tasks, including image understanding, generation, and editing, and even molecular understanding.

Performance Comparison

We comprehensively evaluate Flex-Judge across diverse modalities, including images, videos, and audio, demonstrating its generalization capability and competitive performance against state-of-the-art judge models. Notably, it matches closed-source commercial APIs on vision tasks and outperforms all training-free evaluators in audio understanding:

Table 1. Comparison of MLLM evaluator performances on MLLM-as-a-Judge benchmark. Training-free (TF) models have not been trained on multimodal evaluation data.

Table 2. Comparison of MLLM evaluator performances on VL-RewardBench (Left) and MJ-Bench(Right).

Table 3. Audio MOS/SS prediction results on NISQA, BVCC, SOMOS, and VoxSim. System-level results are computed by averaging the utterance-level results within each TTS system.

BibTeX

@article{ko2025flex,
  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
  journal={arXiv preprint arXiv:2505.18601},
  year={2025}
}