Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers
Rowan Bradbury · Aniket Srinivasan Ashok · Sai Kasanagottu · Gunmay Jhingran · Shuai Meng
Abstract
Replacing modules in pretrained models—especially swapping quadratic self-attention for efficient attention alternatives—poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight $\alpha(t)$. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. Empirically, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.
Chat is not available.
Successful Page Load