Vision-Language Models NLP Research LoRA Fine-tuning 2026 Northeastern University

VLM Counterfactual Consistency

Research investigating whether large vision-language models truly reason about images or rely on superficial pattern-matching — using counterfactual question families and a novel Consistency Score metric.

Research · 2026
Khoury College · Northeastern University
LLaVA-1.5 · InstructBLIP · LoRA · GQA

Standard VQA accuracy metrics hide a critical weakness: a model can score 50% on a benchmark while failing basic logical reasoning. If a model answers "red" to "What color is the car?", it should logically answer "no" to "Is the car blue?" — but many state-of-the-art VLMs don't.

The Research Question

Do vision-language models understand the logical relationships between related questions about the same image, or do they simply pattern-match each question independently?

Approach

Counterfactual question families

For each original question, we generate a "family" of 1–3 logical variants using four intervention types. The model is evaluated not just on each question individually, but on whether its answers across the whole family satisfy the expected logical relations.

Type 01
Negation
Yes/no questions are flipped. If the model says "yes" to the original, it must say "no" to the negated form — and vice versa.
Type 02
Attribute Swaps
Color, size, or material is changed to a different value. The model must answer based on the new attribute, not just recall the entity.
Type 03
Entailment
Logical implications are tested. If A is true, then B (which follows from A) should also be true — the model must maintain logical soundness.
Type 04
Spatial Perturbations
Left/right, on/under, and other spatial relations are reversed. Tests whether the model actually reads spatial language or guesses from priors.
Results

Consistency Score vs. VQA Accuracy

Evaluated on 1,000 GQA questions generating 1,961 counterfactuals total:

Model Consistency Score VQA Accuracy
LLaVA-1.5-7B (base) 0.6925 0.439
InstructBLIP-FlanT5-XL 0.6263 0.502
LLaVA-1.5 + LoRA 0.6598 0.493

Pass rate by intervention type

Intervention LLaVA-1.5 InstructBLIP LLaVA + LoRA
Entailment 99.1% 90.6%
Spatial 53.0% 27.6% 45.1%
Negation 14.9% 27.0% 17.5%
Attribute swap 6.8% 6.2% 8.6%
Key Finding

LLaVA shows higher consistency (0.69 vs 0.63) despite lower raw accuracy — suggesting better compositional reasoning. Both models fail catastrophically on attribute swaps (<7%): they answer based on the entity, not the changed attribute. Negation is also hard for both (15–27%).

LoRA Fine-tuning

Pairwise consistency loss

We designed a novel pairwise consistency loss that operates on answer token distributions across question families, applied during LoRA rank-8 fine-tuning:

Loss 01
Contradiction Loss
Penalises the model when it agrees on answers that should logically be opposite (negation pairs).
Loss 02
Entailment Loss
Penalises the counterfactual for not answering "yes" when the original entails a logical consequence.
Loss 03
Attribute / Spatial Loss
Penalises similar output distributions for questions that differ only in a swapped attribute or spatial relation.

Total training objective: CE_orig + CE_cf + λ · PC_loss

Tradeoff observed: LoRA fine-tuning improves VQA accuracy (+5.4%) and attribute swap reasoning (+1.8%), but hurts spatial reasoning (−7.9%) — revealing that different reasoning modes compete during training. This tension itself is a meaningful finding about how VLMs learn.

Pipeline

Five sequential phases

1
Counterfactual Generation
Converts raw GQA questions into families (original + 1–3 logical variants) by dispatching on question type to apply the appropriate intervention.
2
VLM Inference
Runs pretrained models on all family members. Auto-detects VRAM and falls back to 4-bit quantization (QLoRA) when memory is tight — enabling inference on 8 GB GPUs.
3
Consistency Scoring
Evaluates per-family scores by checking logical relations. Produces per-intervention and per-question-type breakdowns for fine-grained analysis.
4
Consistency-Aware Training
LoRA rank-8 fine-tuning on LLaVA-1.5 using the pairwise consistency loss alongside standard cross-entropy.
5
Generalization Evaluation
Validates that GQA-trained counterfactual fine-tuning transfers to VQA v2 without degrading standard accuracy.
Tech Stack

Models, frameworks, datasets

Models
LLaVA-1.5-7B InstructBLIP-FlanT5-XL
Frameworks
PyTorch HuggingFace Transformers PEFT (LoRA) bitsandbytes
Datasets
GQA VQA v2
Supporting
scikit-learn pandas matplotlib OpenCV
Outcome
Accuracy alone doesn't tell the whole story of visual reasoning.

This work introduces a reproducible framework for probing VLM reasoning beyond accuracy metrics. The Consistency Score reveals a clear gap: today's best open-source VLMs are strong at entailment but fail dramatically at attribute-level reasoning and logical negation. The pairwise consistency loss shows that targeted fine-tuning can close part of this gap — but at the cost of other reasoning modes, pointing to a fundamental tension worth exploring further.

HM
Hemanth Sai .M
MS AI · Northeastern University
← Back to Projects