Improving interpretability via regularization of neural activation sensitivity

Ofir Moshe, Gil Fidel, Ron Bitton, Asaf Shabtai

Research output: Contribution to journalArticlepeer-review


State-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their widespread adoption in mission-critical contexts is limited due to two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about DNNs’ security and generalization in real-world conditions, while the latter, opaqueness, directly impacts interpretability. The lack of interpretability diminishes user trust as it is challenging to have confidence in a model’s decision when its reasoning is not aligned with human perspectives. In this research, we (1) examine the effect of adversarial robustness on interpretability, and (2) present a novel approach for improving DNNs’ interpretability that is based on the regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models, and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.(Code provided in supplementary material.)

Original languageAmerican English
JournalMachine Learning
StateAccepted/In press - 1 Jan 2024


  • Adversarial attack
  • Deep neural networks
  • Interpretability
  • Robustness

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence


Dive into the research topics of 'Improving interpretability via regularization of neural activation sensitivity'. Together they form a unique fingerprint.

Cite this