Making LLMs Faster Without Retraining

The question that started this was simple: can you make an LLM run faster at inference time without touching its training, without swapping hardware, and without changing what it knows? Not quantization — that trades precision for speed. Not distillation — that requires a smaller model. Something more surgical: replace the expensive matrix multiplications inside the model with equivalent operations that are provably cheaper, and trust the math to hold.

Six experiments later, I had an answer. A GPT-2 Large model running 1.6× faster than the original, achieving 11% better perplexity on a standard benchmark. The whole thing cost 55 minutes on a rented GPU and $0.55 in cloud compute.

This is the story of how I got there — including three dead ends, one near-miss, and the moment the zones finally overlapped.

The Core Idea

A single transformer MLP layer computes:

Y = σ(W · X)    W: (out × in),  X: (in × N)

That matrix multiplication costs out × in multiply-adds per token. If we can approximate W as a product of two smaller matrices — a factorization — the cost drops to (out + in) × k, where k is the rank of the approximation:

speedup = (out × in) / ((out + in) × k)

For a speedup greater than 1, we need k < (out × in) / (out + in), which simplifies to roughly k < min(out, in) / 2. The rank must be less than half the smaller dimension.

This is the break-even condition. Everything in this experiment hinges on it.

speedup drops below 1× once rank exceeds half the matrix dimension — the entire experiment lives in the left zone

The question is then: can we find a rank k that is both low enough for real speedup and high enough that quality doesn’t collapse? These two constraints pull in opposite directions. Finding where they overlap — if they do at all — is the entire experiment.

Experiment 01 — Establishing Ground Rules

Before touching any real model, I ran four sub-experiments on 128×128 random matrices. Two things came out of this that shaped everything that followed.

Binary gates are a dead end. An obvious thought: ReLU zeroes out roughly half of all activations. If we could predict which neurons will fire before doing the matmul, we could skip those rows entirely. But a threshold gate at the median activation is mathematically equivalent to ReLU — the relative error is 0.0001. The idea is valid. The speedup is zero. The activation is not the bottleneck; the matrix multiplication is. I eliminated this direction immediately.

Error accumulates brutally. At rank=32 (50% of full rank), a single layer has 25% relative error. After 12 stacked layers, that number compounds to 90%. This set a hard constraint for every subsequent experiment: per-layer error must stay below 5% for a deep network to remain coherent.

At 128×128 matrices, there was no configuration that produced both meaningful speedup and acceptable quality. The break-even rank equals the full rank at this scale — the zones don’t just fail to overlap, they touch only at infinity. Scale would be necessary.

Experiment 02 — GPT-2 Small: Naive SVD

Moving to a real model (GPT-2 small, 117M parameters, 768-dimensional MLP), I ran the obvious baseline: truncated SVD on every weight matrix, sweep ranks, measure perplexity.

The results were unambiguous, and not in a good way:

Rank	% of Dense	Perplexity	×Baseline
256	22%	600+	~23×
512	44%	293	~10×
768	67%	35–50	~1.5–2×

The singular spectrum analysis delivered the first real surprise: rank at 90% energy captures only 39–67% of the matrix dimensions across layers. GPT-2 weights are not low-rank — they use almost all their directions. They were trained under many competing objectives and they didn’t learn to be efficient in the SVD sense.

The fatal geometry: GPU speedup only appears at rank ≤ 256 (theoretical 3.8×), but at rank ≤ 256 the model is completely broken. The speedup zone and the quality zone don’t overlap at this model scale. Naive SVD was dead.

Experiment 03 — Activation-Aware SVD

SVD minimizes weight reconstruction error — ||W - Wk|| in the Frobenius norm. But what actually matters for inference is output error — ||WX - WkX|| on real inputs. These are not the same thing.

ASVD (from arXiv:2312.05821) makes this concrete: scale each column of W by the mean activation magnitude of the corresponding input channel before running SVD, then de-scale after. Channels that carry more signal get more rank budget. The decomposition is reweighted to minimize output error, not weight error.

s = mean_activation_per_channel ** alpha   # α=0.5 from paper
W_scaled = W * s
U, Σ, Vh = SVD(W_scaled)
W_approx = (U[:, :k] * Σ[:k]) @ Vh[:k] / s

The improvement was dramatic:

Rank	Naive PPL	ASVD PPL	Gain
384	601	64	9×
512	293	34	8×

ASVD at rank=512 gave perplexity of 34 — 1.33× baseline, near-usable. Naive SVD at the same rank was unusable at 10× baseline. Activation-aware factorization was clearly the right approach.

But the speedup math was brutal. Rank=512 on 768-dimensional matrices gives only 1.2× theoretical speedup — barely worth the approximation cost. And the per-layer analysis revealed a second problem: layer 0 is catastrophically sensitive. Compressing only layer 0 to rank=128 raised perplexity from 28 to 136 — a 5× hit. Every other layer was tolerant. This created a structural blocker.

The speedup zone and quality zone were still not overlapping at GPT-2 small scale. The math was clear about why: the break-even rank for 768-dimensional matrices is 384 — exactly where even ASVD starts to fail.

Experiment 04 — Monarch Projection

Perhaps the problem with SVD is that it finds one global set of orthogonal directions for the entire matrix. Monarch approximation (arXiv:2204.00595) uses a different structure: W ≈ L · P^T · R, where L and R are block-diagonal matrices and P is a permutation. Each block captures its own local subspace independently, without competing for global singular vectors.

Monarch consistently outperformed SVD and ASVD at equal parameter budgets:

n_blocks	Equiv Rank	Naive SVD	ASVD	Monarch
8	83	5789 (206×)	7891 (281×)	2406 (85×)

Roughly 3× better perplexity at the same parameter budget. Block structure captures local geometry that global SVD misses.

The most instructive failure of the entire experiment came here: I tried combining ASVD’s activation scaling with Monarch factorization (AA-Monarch). The result was worse than plain Monarch at every setting. The weight reconstruction error was higher, and the perplexity was higher.

ASVD scaling helps in the high-budget regime (rank ≥ 40% of dense), hurts in the low-budget regime (rank < 20%). At very low parameter budgets, every rank slot matters, and distorting the spectrum before decomposition wastes capacity instead of redirecting it. The activation-aware trick is not universally beneficial — it only works when there’s enough budget to be selective.

A second observation from this experiment: Monarch had higher Frobenius weight reconstruction error than SVD, yet lower perplexity. Optimizing weight space is a poor proxy for output quality. This lesson applies to all compression work: always evaluate on the actual task metric.

But even with the 3× structural advantage, Monarch at n_blocks=8 still gave 85× baseline perplexity. The bottleneck was GPT-2 small’s size, not the approximation method. GPT-2 small is simply too small for low-rank factorization to work in the speedup-quality overlap sense.

Experiment 05 — GPT-2 Large: The First Overlap

The break-even rank scales with matrix dimension. For 768-dim matrices, it’s 384. For 1280-dim matrices (GPT-2 Large), it’s 640. Larger matrices have more absolute rank budget before hitting the break-even point — they are inherently more compressible in the low-rank sense.

I applied ASVD to all 36 MLP layers of GPT-2 Large (774M parameters) and swept ranks:

Rank	Theory Speedup	Naive PPL	ASVD PPL
256	4.00×	13035	5516
512	2.00×	2133	839
640	1.60×	1554	45.9 (2.78×)
768	1.33×	996	22.6 (1.37×)
1024	1.00×	25.2	18.9 (1.14×)

Rank=640: 1.6× speedup at 2.78× baseline perplexity. Rank=768: 1.33× speedup at 1.37× baseline. For the first time, the numbers are in the same ballpark. The zones touched.

actual experiment data — at rank=640 (break-even for 1.6× speedup), ASVD reaches 2.78× baseline while Naive SVD stays at 94×

But the more important result was the per-layer sensitivity analysis. I compressed each of the 36 layers individually — one layer at a time, all others kept dense — at rank=640:

Every single layer individually tolerates rank=640 compression within ±2% of baseline perplexity.

This is the opposite of GPT-2 small, where layer 0 was catastrophically fragile. At GPT-2 Large scale, information is distributed evenly across all layers. The 2.78× global degradation is purely cumulative: 36 layers × small error = large compound error. No individual layer is broken. The damage is additive, not structural.

This distinction matters enormously. Structural damage (one broken link) cannot be recovered by fine-tuning — the information is gone. Additive damage (36 slightly worn links) can be recovered — the information is still there, just distributed suboptimally. That distinction opened the door to the final experiment.

Experiment 06 — LoRA Recovery

If every layer’s error is small and individually recoverable, then LoRA adapters — small trainable rank-r corrections added on top of frozen weights — should be able to absorb each layer’s local error. The question: can rank-16 adapters per layer fully recover the damage from rank-640 compression across 36 layers, using only 1.13% trainable parameters?

Setup:

ASVD rank=640 applied to all MLP and attention projection layers (36 layers × 3 weight matrices)
LoRA rank=16 adapters on top of frozen compressed weights
Trainable parameters: 8.8M out of 783M total (1.13%)
Dataset: WikiText-2 (2.4M training tokens)
3 epochs on a Modal T4 GPU — 55 minutes total, $0.55

Results:

Stage	Wiki PPL	×Baseline	Time
Baseline (dense)	27.56	1.00×	—
ASVD rank=640	153.82	5.58×	—
+ LoRA epoch 1	25.57	0.93×	18 min
+ LoRA epoch 2	24.70	0.90×	37 min
+ LoRA epoch 3	24.53	0.89×	55 min

The compressed model, after LoRA recovery, surpasses the uncompressed original by 11% — while keeping the full 1.6× inference speedup.

actual WikiText-2 perplexity at each checkpoint — back at baseline by minute 9, below it by minute 18

The recovery curve is the most striking result. ASVD compression degraded WikiText PPL from 27.56 to 153.82 — a 5.58× hit. After 9 minutes of LoRA fine-tuning (300 steps), the model is already back at baseline. After 18 minutes (epoch 1), it’s below baseline. 75% of the total recovery happened in epoch 1.

This is not typical fine-tuning behavior. It is more like error correction than adaptation. The model isn’t learning a new task — it’s undoing small systematic errors in 36 layers simultaneously. Each rank-16 adapter is correcting a small local error in its layer, and the sum of 36 small corrections fully repairs the cascade.

At inference time, the LoRA adapters are not needed. They can be merged or discarded after training. The deployed model uses only the compressed factored weights at rank=640, preserving the full 1.6× speedup.

What the Six Experiments Taught Me

Scale is the gating factor. The break-even rank is k < min(out, in) / 2. At 768-dim, this is 384 — exactly where quality collapses. At 1280-dim, this is 640 — where ASVD gives a recoverable 2.78× quality hit. Larger models are inherently more compressible in the low-rank sense. This is why compression research consistently reports better results on larger models.

Output error is not weight error. Monarch had higher Frobenius weight reconstruction error than SVD but lower perplexity. ASVD outperforms naive SVD by 8–34× at the same rank. The metric that matters is the task metric. Weight reconstruction, singular value decay, Frobenius norm — these are proxies. Perplexity is the signal.

Per-layer analysis reveals whether damage is structural or additive. GPT-2 small: one fragile layer, structural blocker. GPT-2 Large: all layers tolerant, damage purely cumulative. The distinction determines whether recovery-by-fine-tuning is possible at all.

Activation scaling helps only when there is budget to be selective. ASVD scaling dramatically improves quality at rank ≥ 40% of dense. At rank < 20%, it can hurt by distorting the spectrum. Applying it universally is a mistake.

Infrastructure constraints are research constraints. Two local training runs were killed mid-epoch by system OOM — Firefox competing for 12.5 GB VRAM with swap exhausted. Moving to a Modal T4 resolved this for $0.55. The cleaner failure mode of cloud compute is worth the overhead for experiments that push local memory limits.

The Pipeline

The full recipe that works, confirmed end-to-end on GPT-2 Large:

1. Calibration (one-time)
   Run ~260 tokens through the model.
   s_j = mean|X_j|^0.5  per input channel, per layer.

2. ASVD compression (one-time, offline)
   For each target layer:
     W_scaled = W * s
     U, Σ, Vh = SVD(W_scaled)
     W_approx = (U[:,:k] * Σ[:k]) @ Vh[:k] / s
     Store in factored form: A = U[:,:k] * Σ[:k],  B = Vh[:k]

3. LoRA fine-tuning (~1 epoch, offline)
   Add rank-16 LoRA adapters on frozen compressed weights.
   Fine-tune on domain-relevant data.
   ~1.13% trainable parameters, ~18 min on T4.

4. Inference
   Replace nn.Linear with factored matmul: y = (x @ B.T) @ A.T
   Theoretical speedup: 1.6× at rank=640 on 1280-dim matrices.

One practical detail worth calling out: GPT-2 stores its linear layers as Conv1D with weights shaped (in, out) — transposed relative to standard PyTorch Linear convention. ASVD must transpose before SVD and back after, or the compression is silently wrong. PEFT’s LoRA handles this automatically by detecting Conv1D and setting fan_in_fan_out=True. This bug would have been hard to notice without careful reading of the model source.

What’s Next

Experiment 07 (planned): LLaMA-7B. The same ASVD + LoRA pipeline on a model practitioners actually deploy. LLaMA-7B’s MLP hidden dimension is 4096, giving a break-even rank of 2048. The hypothesis: results should be strictly better than GPT-2 Large because the matrices are 3× larger — more redundancy, more room for compression. 4-bit loading via bitsandbytes fits in ~4 GB VRAM.

Experiment 08 (planned): Actual wall-clock speedup. All speedup numbers above are theoretical FLOPs reductions. Real PyTorch inference with nn.Linear doesn’t automatically benefit from factored weights — you need to either replace the Linear module with a custom one that stores A and B separately, or use a fused kernel. The question is whether 1.6× in FLOPs translates to 1.6× in wall-clock time. It usually doesn’t, fully — memory bandwidth and kernel launch overhead both matter. Experiment 08 will measure the real number.

Non-uniform rank allocation. All experiments used the same rank=640 across all 36 layers. The per-layer sensitivity analysis showed all layers are equally tolerant at that rank — but some may tolerate lower ranks without quality loss. A non-uniform allocation at the same total FLOPs budget could give better quality or more speedup.

The Model Is on HuggingFace

The trained adapter — LoRA rank=16 on top of ASVD rank=640 compressed GPT-2 Large — is published at medelharchaoui/gpt2-large-asvd640-lora. You can load it directly with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("gpt2-large")
model = PeftModel.from_pretrained(base, "medelharchaoui/gpt2-large-asvd640-lora")

One thing worth noting: loading the adapter on top of an uncompressed GPT-2 Large will work in the sense that it won’t error, but the speedup won’t be there — the LoRA weights were trained specifically on the ASVD-compressed base. To get both the quality and the 1.6× inference speedup, you need to run the ASVD compression step first, then load the adapter on top of the factored weights.

The thing that surprised me most about this experiment was how fast the recovery happened. The model was broken — 5.58× baseline perplexity — and 9 minutes of gradient steps brought it back to baseline, then below it. When I saw step 300 land at 27.62, exactly at baseline, with 45 more minutes of training still available, I realized the compression hadn’t destroyed the model’s capacity. It had just rearranged it in a slightly suboptimal way. The LoRA pass found the rearrangement and fixed it.

That’s the insight worth keeping: compression doesn’t have to be the end state. It can be a distortion you then correct cheaply. The one-time cost of correction is paid offline. The speedup is reaped at every inference call, forever.

All code, experiment logs, and results are at github.com/elharchaoui/OptIAI. Hardware: RTX 3060 (12.5 GB VRAM) for local experiments, Modal T4 for training.