NeurIPS Adversarial Attacks on Neuron Interpretation via Activation Maximization

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Adversarial Attacks on Neuron Interpretation via Activation Maximization

Alex Fulleringer · Geraldin Nanfack · Jonathan Marty · Michael Eickenberg · Eugene Belilovsky

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Feature visualization is one of the most popular techniques to interpret the internal behavior of individual units of trained deep neural networks. Based on activation maximization, they consist of finding

synthetic

$\textit{synthetic}$ or

natural

$\textit{natural}$ inputs that maximize neuron activations. This paper introduces an optimization framework that aims to deceive feature visualization through adversarial model manipulation. It consists of fine-tuning a pre-trained model with a specifically introduced loss that aims to maintain model performance, while also significantly changing feature visualization. We provide evidence of the success of this manipulation on several pre-trained models for the ImageNet classification task.

Chat is not available.

Poster in Workshop: Attributing Model Behavior at Scale (ATTRIB)

Adversarial Attacks on Neuron Interpretation via Activation Maximization

Alex Fulleringer · Geraldin Nanfack · Jonathan Marty · Michael Eickenberg · Eugene Belilovsky

Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)