VLM Counterfactual Consistency

Standard VQA accuracy metrics hide a critical weakness: a model can score 50% on a benchmark while failing basic logical reasoning. If a model answers "red" to "What color is the car?", it should logically answer "no" to "Is the car blue?" — but many state-of-the-art VLMs don't.

The Research Question

Do vision-language models understand the logical relationships between related questions about the same image, or do they simply pattern-match each question independently?

Approach

Counterfactual question families

For each original question, we generate a "family" of 1–3 logical variants using four intervention types. The model is evaluated not just on each question individually, but on whether its answers across the whole family satisfy the expected logical relations.

Type 01

Negation

Yes/no questions are flipped. If the model says "yes" to the original, it must say "no" to the negated form — and vice versa.

Type 02

Attribute Swaps

Color, size, or material is changed to a different value. The model must answer based on the new attribute, not just recall the entity.

Type 03

Entailment

Logical implications are tested. If A is true, then B (which follows from A) should also be true — the model must maintain logical soundness.

Type 04

Spatial Perturbations

Left/right, on/under, and other spatial relations are reversed. Tests whether the model actually reads spatial language or guesses from priors.

Results

Consistency Score vs. VQA Accuracy

Evaluated on 1,000 GQA questions generating 1,961 counterfactuals total:

Model	Consistency Score	VQA Accuracy
LLaVA-1.5-7B (base)	0.6925	0.439
InstructBLIP-FlanT5-XL	0.6263	0.502
LLaVA-1.5 + LoRA	0.6598	0.493

Pass rate by intervention type

Intervention	LLaVA-1.5	InstructBLIP	LLaVA + LoRA
Entailment	99.1%	90.6%	—
Spatial	53.0%	27.6%	45.1%
Negation	14.9%	27.0%	17.5%
Attribute swap	6.8%	6.2%	8.6%

Key Finding

LLaVA shows higher consistency (0.69 vs 0.63) despite lower raw accuracy — suggesting better compositional reasoning. Both models fail catastrophically on attribute swaps (<7%): they answer based on the entity, not the changed attribute. Negation is also hard for both (15–27%).

LoRA Fine-tuning

Pairwise consistency loss

We designed a novel pairwise consistency loss that operates on answer token distributions across question families, applied during LoRA rank-8 fine-tuning:

Loss 01

Contradiction Loss

Penalises the model when it agrees on answers that should logically be opposite (negation pairs).

Loss 02

Entailment Loss

Penalises the counterfactual for not answering "yes" when the original entails a logical consequence.

Loss 03

Attribute / Spatial Loss

Penalises similar output distributions for questions that differ only in a swapped attribute or spatial relation.

Total training objective: CE_orig + CE_cf + λ · PC_loss

Tradeoff observed: LoRA fine-tuning improves VQA accuracy (+5.4%) and attribute swap reasoning (+1.8%), but hurts spatial reasoning (−7.9%) — revealing that different reasoning modes compete during training. This tension itself is a meaningful finding about how VLMs learn.

Pipeline

Five sequential phases

Counterfactual Generation

Converts raw GQA questions into families (original + 1–3 logical variants) by dispatching on question type to apply the appropriate intervention.

VLM Inference

Runs pretrained models on all family members. Auto-detects VRAM and falls back to 4-bit quantization (QLoRA) when memory is tight — enabling inference on 8 GB GPUs.

Consistency Scoring

Evaluates per-family scores by checking logical relations. Produces per-intervention and per-question-type breakdowns for fine-grained analysis.

Consistency-Aware Training

LoRA rank-8 fine-tuning on LLaVA-1.5 using the pairwise consistency loss alongside standard cross-entropy.

Generalization Evaluation

Validates that GQA-trained counterfactual fine-tuning transfers to VQA v2 without degrading standard accuracy.

Tech Stack

Models, frameworks, datasets

Models

LLaVA-1.5-7B InstructBLIP-FlanT5-XL

Frameworks

PyTorch HuggingFace Transformers PEFT (LoRA) bitsandbytes

Datasets

GQA VQA v2

Supporting

scikit-learn pandas matplotlib OpenCV

Outcome

Accuracy alone doesn't tell the whole story of visual reasoning.

This work introduces a reproducible framework for probing VLM reasoning beyond accuracy metrics. The Consistency Score reveals a clear gap: today's best open-source VLMs are strong at entailment but fail dramatically at attribute-level reasoning and logical negation. The pairwise consistency loss shows that targeted fine-tuning can close part of this gap — but at the cost of other reasoning modes, pointing to a fundamental tension worth exploring further.

↗ View on GitHub