Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Adversarial Attacks on Neuron Interpretation via Activation Maximization

Alex Fulleringer · Geraldin Nanfack · Jonathan Marty · Michael Eickenberg · Eugene Belilovsky

Abstract: Feature visualization is one of the most popular techniques to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding $\textit{synthetic}$ or $\textit{natural}$ inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of fine-tuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the ImageNet classification task.

Chat is not available.