How Low-Rank Adaptation (LoRA) Changes the Geometry of Self-Attention

Sunday. January 11, 2026 - 8 mins

How Low-Rank Adaptation (LoRA) Changes the Geometry of Self-Attention

Introduction

Low-Rank Adaptation (LoRA) is typically presented as a parameter-efficient alternative to full fine-tuning, motivated by reducing memory and compute costs. While the low-rank formulation explains how LoRA reduces the number of trainable parameters, it does not explain why LoRA is particularly effective in attention-based models.

This post analyzes LoRA from a geometric perspective, focusing on how low-rank updates reshape the attention mechanism itself. We combine mathematical intuition with a minimal PyTorch implementation to empirically demonstrate how LoRA alters attention scores and distributions while keeping the backbone frozen.

Self-Attention as a Geometric Operation

Given an input sequence $X \in \mathbb{R}^{n \times d}$, self-attention computes:

\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]

The attention output is:

\[\text{Attention}(X) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right) V\]

Each query vector $q_i$ defines a direction in embedding space. Attention weights depend on the angular alignment between $q_i$ and all key vectors $k_j$. From this view, self-attention performs a context-dependent projection of values onto a subspace defined by query–key similarity.

A Minimal Self-Attention Implementation

To ground this discussion, we start with a minimal single-head self-attention module. The goal is not efficiency or completeness, but explicit access to intermediate tensors for analysis.

class TinySelfAttention(nn.Module):
    def __init__(self, d_model: int):
        super().__init__()
        self.d = d_model
        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, d_model, bias=False)
        self.v_proj = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        Q = self.q_proj(x)
        K = self.k_proj(x)
        V = self.v_proj(x)

        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d)
        attn = F.softmax(scores, dim=-1)
        out = attn @ V
        return out, Q, K, scores, attn

This implementation makes the geometry explicit: queries and keys interact via dot products, producing a score matrix that determines how information flows across tokens.

LoRA in Attention Layer

LoRA modifies a projection matrix $W \in \mathbb{R}^{d \times d}$ as:

\[W' = W + BA\]

where:

$A \in \mathbb{R}^{r \times d}$
$B \in \mathbb{R}^{d \times r}$
$r \ll d$

The update $\Delta W = BA$ has rank at most $r$, constraining learning to a low-dimensional subspace.

Minimal LoRA Implementation

We implement LoRA by wrapping an existing linear layer and adding a trainable low-rank residual branch. The base weights remain frozen.

class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int = 4, alpha: float = 8.0):
        super().__init__()
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False

        self.A = nn.Parameter(torch.randn(r, base.in_features) * 0.01)
        self.B = nn.Parameter(torch.zeros(base.out_features, r))
        self.scale = alpha / r

    def forward(self, x):
        base_out = self.base(x)
        lora_out = (x @ self.A.t()) @ self.B.t()
        return base_out + self.scale * lora_out

Initializing $B=0$ ensures that the LoRA branch starts as a no-op, making the effect of adaptation easy to isolate.

Injecting LoRA into Self-Attention

We now apply LoRA only to the query projection:

attn = TinySelfAttention(d_model=64)
attn.q_proj = LoRALinear(attn.q_proj, r=4)

Keys and values remain unchanged. Any change in attention behavior must therefore arise from query reorientation alone.

Effect on Attention Scores

Recall that attention scores are computed as:

\[S = \frac{Q K^\top}{\sqrt{d}}\]

With LoRA applied:

\[Q' = Q + \Delta Q\]

This introduces an additive term:

\[S' = \frac{(Q + \Delta Q) K^\top}{\sqrt{d}}\]

Importantly, LoRA does not rewrite the score matrix arbitrarily; it introduces a structured, low-rank perturbation.

Empirical Evidence: How LoRA Alters Attention Geometry

We now analyze the effect of LoRA through a sequence of visualizations. All plots are generated by keeping the input sequence fixed and training only the LoRA parameters on the query projection, while freezing the backbone.

The target during training is to make query token 0 attend strongly to key token 5.

Attention Distribution Before LoRA

Figure: Attention map BEFORE LoRA training Attention Before LoRA

Key observations:

Attention is relatively diffuse
For query 0, attention mass is spread across many keys (e.g. A[0,5] ≈ 0.11)
No single key dominates the attention distribution

This is expected behavior for an unadapted model with random initialization or generic pre-training.

Attention Distribution After LoRA Training

Figure: Attention map AFTER LoRA training Attention After LoRA Interpretation: LoRA introduces a targeted reallocation of attention mass. Crucially:

Only the intended query changes behavior
The rest of the attention structure is preserved

This demonstrates that LoRA enables localized control over attention routing rather than wholesale retraining.

Score Matrix Before LoRA (Pre-Softmax Geometry)

Figure: Score matrix BEFORE LoRA — $QK^\top / \sqrt{d}$ Scores Before LoRA This matrix represents the raw geometric interaction between queries and keys before softmax.

Observations:

Score magnitudes are moderate and balanced
No extreme values dominate any row
The geometry reflects generic similarity structure
At this stage, no single query–key interaction is privileged.

Score Matrix After LoRA

Figure: Score matrix AFTER LoRA — $Q’K^\top / \sqrt{d}$ Scores After LoRA

Key changes:

A strong, localized increase appears at (query=0, key=5)
Most other entries remain similar to the pre-LoRA matrix
The change is additive and structured, not noisy

Interpretation:

LoRA modifies the score matrix via the term: $\Delta S = \frac{\Delta Q K^\top}{\sqrt{d}}$ Since keys are frozen, this confirms that query rotation alone is sufficient to reshape attention geometry.

Query Rotation: Directional Change

Figure: Per-token cosine similarity $\cos(Q_{\text{before}}, Q_{\text{after}})$ Query Cosine Similarity This plot measures directional change in query vectors.

Observations:

Token 0 exhibits a large drop in cosine similarity
Most other tokens remain close to 1.0
Directional change is selective, not global

Interpretation:

LoRA performs controlled rotations of specific query vectors.
Rather than rewriting embeddings, it reorients queries in representation space.

This supports the geometric view of LoRA as subspace reparameterization.

Query Update Magnitude

Figure: Per-token $|\Delta Q|$ magnitude after LoRA training Query Update Magnitude

Observations:

Token 0 shows a significantly larger $ \Delta Q $
Other tokens have near-zero updates
Update sparsity emerges naturally from training

Interpretation:

Although LoRA parameters are shared across tokens, the model learns to apply uneven geometric influence, concentrating adaptation where it matters. This explains why LoRA is both:

Expressive
Stable under limited parameterization

Why LoRA Works Well on Queries

Modifying queries affects:

Which keys are selected
How attention mass is distributed

Keys serve as anchors; excessive modification destabilizes similarity structure. LoRA’s low-rank constraint ensures stable, localized adaptation.

Conclusion

LoRA is effective not merely because it is parameter efficient, but because it aligns naturally with the geometry of self-attention. By restricting updates to a low-rank subspace, LoRA performs controlled rotations of query directions, leading to targeted changes in attention without disrupting the backbone.

Understanding LoRA at this geometric level provides practical guidance on: - where to apply it - how to choose rank - and why it outperforms naive fine-tuning in large attention-based models.

References

Hu et al., LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2021
Vaswani et al., Attention Is All You Need, NeurIPS 2017
He et al., Towards a Unified View of Parameter-Efficient Transfer Learning, ICLR 2022
Dettmers et al., QLoRA, NeurIPS 2023

How Low-Rank Adaptation (LoRA) Changes the Geometry of Self-Attention

How Low-Rank Adaptation (LoRA) Changes the Geometry of Self-Attention

Introduction

Self-Attention as a Geometric Operation

A Minimal Self-Attention Implementation

LoRA in Attention Layer

Minimal LoRA Implementation

Injecting LoRA into Self-Attention

Effect on Attention Scores

Empirical Evidence: How LoRA Alters Attention Geometry

Attention Distribution Before LoRA

Attention Distribution After LoRA Training

Score Matrix Before LoRA (Pre-Softmax Geometry)

Score Matrix After LoRA

Query Rotation: Directional Change

Query Update Magnitude

Why LoRA Works Well on Queries

Conclusion

References

Related Posts