Vision-Language Models
NLP Research
LoRA Fine-tuning
2026
Northeastern University
VLM Counterfactual Consistency
Research investigating whether large vision-language models truly reason about images or
rely on superficial pattern-matching — using counterfactual question families and a novel Consistency Score
metric.
Khoury College · Northeastern University
LLaVA-1.5 · InstructBLIP · LoRA · GQA
Standard VQA accuracy metrics hide a critical weakness: a model can score 50% on a benchmark while
failing basic logical reasoning. If a model answers "red" to "What color is the car?",
it should logically answer "no" to "Is the car blue?" — but many state-of-the-art VLMs
don't.
The Research Question
Do vision-language models understand the logical relationships between related questions about the
same image, or do they simply pattern-match each question independently?
Approach
Counterfactual question families
For each original question, we generate a "family" of 1–3 logical variants using four intervention
types. The model is evaluated not just on each question individually, but on whether its answers across the whole
family satisfy the expected logical relations.
Type 01
Negation
Yes/no questions are flipped. If the model says "yes" to the original, it must say
"no" to the negated form — and vice versa.
Type 02
Attribute Swaps
Color, size, or material is changed to a different value. The model must answer
based on the new attribute, not just recall the entity.
Type 03
Entailment
Logical implications are tested. If A is true, then B (which follows from A) should
also be true — the model must maintain logical soundness.
Type 04
Spatial Perturbations
Left/right, on/under, and other spatial relations are reversed. Tests whether the
model actually reads spatial language or guesses from priors.
Results
Consistency Score vs. VQA Accuracy
Evaluated on 1,000 GQA questions generating 1,961 counterfactuals total:
| Model |
Consistency Score |
VQA Accuracy |
| LLaVA-1.5-7B (base) |
0.6925 |
0.439 |
| InstructBLIP-FlanT5-XL |
0.6263 |
0.502 |
| LLaVA-1.5 + LoRA |
0.6598 |
0.493 |
Pass rate by intervention type
| Intervention |
LLaVA-1.5 |
InstructBLIP |
LLaVA + LoRA |
| Entailment |
99.1% |
90.6% |
— |
| Spatial |
53.0% |
27.6% |
45.1% |
| Negation |
14.9% |
27.0% |
17.5% |
| Attribute swap |
6.8% |
6.2% |
8.6% |
Key Finding
LLaVA shows higher consistency (0.69 vs 0.63) despite lower raw accuracy — suggesting better compositional
reasoning. Both models fail catastrophically on attribute swaps (<7%): they answer based on the
entity, not the changed attribute. Negation is also hard for both (15–27%).
LoRA Fine-tuning
Pairwise consistency loss
We designed a novel pairwise consistency loss that operates on answer token
distributions across question families, applied during LoRA rank-8 fine-tuning:
Loss 01
Contradiction Loss
Penalises the model when it agrees on answers that should logically be opposite
(negation pairs).
Loss 02
Entailment Loss
Penalises the counterfactual for not answering "yes" when the original entails a
logical consequence.
Loss 03
Attribute / Spatial Loss
Penalises similar output distributions for questions that differ only in a swapped
attribute or spatial relation.
Total training objective: CE_orig + CE_cf + λ · PC_loss
Tradeoff observed: LoRA fine-tuning improves VQA accuracy (+5.4%) and attribute
swap reasoning (+1.8%), but hurts spatial reasoning (−7.9%) — revealing that different reasoning modes compete
during training. This tension itself is a meaningful finding about how VLMs learn.
Pipeline
Five sequential phases
1
Counterfactual Generation
Converts raw GQA questions into families (original + 1–3 logical variants) by
dispatching on question type to apply the appropriate intervention.
2
VLM Inference
Runs pretrained models on all family members. Auto-detects VRAM and falls back to 4-bit
quantization (QLoRA) when memory is tight — enabling inference on 8 GB GPUs.
3
Consistency Scoring
Evaluates per-family scores by checking logical relations. Produces per-intervention
and per-question-type breakdowns for fine-grained analysis.
4
Consistency-Aware Training
LoRA rank-8 fine-tuning on LLaVA-1.5 using the pairwise consistency loss alongside
standard cross-entropy.
5
Generalization Evaluation
Validates that GQA-trained counterfactual fine-tuning transfers to VQA v2 without
degrading standard accuracy.
Tech Stack
Models, frameworks, datasets
Models
LLaVA-1.5-7B
InstructBLIP-FlanT5-XL
Frameworks
PyTorch
HuggingFace Transformers
PEFT (LoRA)
bitsandbytes
Supporting
scikit-learn
pandas
matplotlib
OpenCV
Outcome
Accuracy alone doesn't tell the whole story of visual reasoning.
This work introduces a reproducible framework for probing VLM reasoning beyond accuracy metrics. The
Consistency Score reveals a clear gap: today's best open-source VLMs are strong at entailment but fail
dramatically at attribute-level reasoning and logical negation. The pairwise consistency loss shows that
targeted fine-tuning can close part of this gap — but at the cost of other reasoning modes, pointing to a
fundamental tension worth exploring further.