Poster
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable
Shreyash Arya · Sukrut Rao · Moritz Böhle · Bernt Schiele
East Exhibit Hall A-C #3108
Understanding the decisions of deep neural networks (DNNs) has been a challenging task due to their ‘black-box’ nature. Methods such as feature attributions that attempt to explain the decisions of such models post-hoc, while popular, have been shown to often yield explanations that are not faithful to the model. Recently, B-cos networks were proposed as a means of instead designing such networks to be inherently interpretable by architecturally enforcing stronger alignment between inputs and weights, yielding highly human interpretable explanations that are model-faithful by design. However, unlike with post-hoc methods, this requires training new models from scratch, which represents a major hurdle for establishing such novel models as an alternative to existing ones, in particular due to the increasing reliance on large, pre-trained foundational models. In this work, inspired by the architectural similarities in standard DNNs and B-cos networks, we propose ‘B-cosification’, a novel approach to transform existing pre-trained models to become inherently interpretable. We perform a thorough study of design choices to perform this conversion, both for convolutional neural networks and vision transformers. We find that B-cosification can yield models that are on par with B-cos models trained from scratch in terms of interpretability, while often outperforming them in terms of classification performance at a fraction of the training cost. Subsequently, we apply B-cosification to CLIP models, and show that, even with limited data and compute cost, we obtain B-cosified CLIP models that are highly interpretable and are competitive on zero shot performance across a variety of datasets.
Live content is unavailable. Log in and register to view live content