Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Autonomous Driving

Spatial-Temporal Gated Transformersfor Efficient Video Processing

Yawei Li · Babak Ehteshami Bejnordi · Bert Moons · Tijmen Blankevoort · Amirhossein Habibian · Radu Timofte · Luc V Gool


Abstract:

We focus on the problem of efficient video stream processing with fully transformer-based architectures. Recent advances brought by transformers for image-based tasks inspires the research interests of applying transformers for videos. Yet, when applying image-based transformer solutions to videos, the computation becomes inefficient due to the redundant information in adjacent video frames. An analysis of the computation cost of the video object detection framework DETR identifies the linear layers as the major computation bottleneck. Thus, we propose dynamic gating layers to conduct conditional computation. With the generated binary or ternary gates, it is possible to avoid the computation for the stable background tokens in the video frames. The effectiveness of the dynamic gating mechanism for transformers is validated by experimental results. For video object detection, the FLOPs could be reduced by 48.3% without a significant drop of accuracy.

Chat is not available.