A Constrained Optimization Perspective of Unrolled Transformers
Javier Porras-Valenzuela · Samar Hadou · Alejandro Ribeiro
Abstract
This work introduces a constrained perspective of training transformers that behave like optimization descent algorithms. To this end, we impose layerwise descent constraints on the objective function and train with a primal-dual algorithm instead of empirical risk minimization (ERM). This method produces models that monotonically descend in expectation along the layers. We apply our method to both existing transformer-based unrollings and conventional pretrained transformers in tasks of video denoising and language classification. The experimental evidence indicates that our method yield models that are more robust to perturbations.
Chat is not available.
Successful Page Load