StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He¹, Baiqi Wu², Xingming Li¹, Ao Cheng¹, Qiyao Sun¹, Xuanyu Ji¹, Qingyong Hu³*

¹National University of Defense Technology ²Zhejiang University ³Intelligent Game and Decision Lab

^*Corresponding author

StemBind overview: nine RI/VP operations, shared-stem P/R/F probes, and S1-S4 process stages.

MLLMs often know the rule but pick the wrong answer: a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. StemBind is built to catch exactly this failure on the same visual evidence.

Abstract

Existing abstract visual reasoning (AVR) benchmarks collapse perception, rule induction, and answer selection into a single right-or-wrong signal, so they cannot tell why a model failed. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions — Perception, Rule, and Full — so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply).

Evaluating 24 frontier MLLM configurations yields four findings: (i) rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified; (ii) even when perception and rule are both correct on the same stem, models still answer Full incorrectly 51.2% of the time; (iii) process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to S3 rule-to-instance mapping; and (iv) neither scaling up model size nor enabling explicit thinking reliably closes the gap. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down.

Benchmark at a Glance

2,298

curated stems

19,533

P/R/F tasks

9

visual operations

S1–S4

process stages

24

MLLM configs

Task split: 14,937 Perception · 2,298 Rule · 2,298 Full. Operations span the Rule-Induction family (RI-Pos, RI-Sty, RI-Attr, RI-Qty, RI-Rel) and the Visual-Processing family (VP-Fold, VP-View, VP-Rot, VP-Closure).

Key Findings

Finding 1

The R–F chasm and a persistent binding gap

Rule accuracy exceeds full-item accuracy on 22 of 24 models, often by 20–34 points. Even on the strict subset where perception and rule are both correct on the same stem, models still miss the full item 51.2% of the time. The failure is behavioral rule-to-instance binding, not a missing rule.

Finding 2

The bottleneck is S3 Map

Stage-wise judging and Stage-wise Stimulus Augmentation (SSA) both point to S3 rule-to-instance mapping: S3 is the weakest stage on every full-split row, and the largest SSA gain appears precisely when S3 alignment is injected. This is a behavioral localization, not a mechanistic claim.

Finding 3

Scaling does not close the gap

Within-family scaling shifts perception and rule far more than full-item accuracy. Qwen3.5 dense peaks pre-MoE and the largest MoE variant collapses on rule, while Gemma 4 and InternVL3.5 improve on Full yet keep the R–F chasm.

Family scaling across Qwen3.5, Gemma 4, InternVL3.5

Finding 4

Thinking does not repair the gap

Across paired direct/thinking rows, explicit thinking lifts perception on nine of ten rows but lowers both rule and full-item accuracy on every row. Longer traces aid local descriptions but break the slot-level correspondence the full item requires.

Leaderboard

Aggregate P/R/F performance across evaluated models

Full per-model results and the evaluation harness are in the code repository.

Citation

@article{he2026stembind,
  title   = {StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning},
  author  = {He, Xixiang and Wu, Baiqi and Li, Xingming and Cheng, Ao and Sun, Qiyao and Ji, Xuanyu and Hu, Qingyong},
  journal = {arXiv preprint arXiv:2606.00148},
  year    = {2026}
}