Training Robust ML-based Raw-Binary Malware Detectors in Hours, not Months

Keane Lucas, Weiran Lin, Lujo Bauer, Michael K. Reiter, Mahmood Sharif

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Machine-learning (ML) classifiers are increasingly used to distinguish malware from benign binaries. Recent work has shown that ML-based detectors can be evaded by adversarial examples, but also that one may defend against such attacks via adversarial training. However, adversarial training, and subsequent robustness evaluation, is computationally expensive in the raw-binary malware-detection domain because it requires producing many adversarial examples for both training and evaluation. Prior work found that Greedy-training, a faster robust training technique that forgoes using adversarial examples, showed some promise in producing robust malware detectors. However, Greedy-training was far less effective in inducing robustness than the more expensive adversarial training, and it also severely hurt natural accuracy (i.e., accuracy on the original data). To faster train models, this work presents GreedyBlock-training, an enhanced version of Greedy-training that we empirically show achieves not only state-of-the-art robustness in malware detectors, exceeding even adversarial training, but also retains natural accuracy better than adversarial training. Furthermore, as it does not require creating adversarial (or functional) examples, GreedyBlock-training is significantly faster than adversarial training. Specifically, we show that GreedyBlock-training can produce more robust (+54% on average), more naturally accurate (+7% on average), and more efficiently trained (-91% average computation) malware detectors than prior work. To faster evaluate models, we also develop methods to faster gauge the robustness of ML-based raw-binary malware detectors by introducing robustness proxies, which can be used either to predict which models are likely to be the most robust, thus helping prioritize which detectors to evaluate with expensive attacks, or aiding in deciding which detectors are worthwhile to continue training. Experimentally, we show these proxy measures can find the most robust detector in a pool of detectors while using only ~20-50% of the computation that would otherwise be required.

Original languageEnglish
Title of host publicationCCS 2024 - Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security
Pages124-138
Number of pages15
ISBN (Electronic)9798400706363
DOIs
StatePublished - 9 Dec 2024
Event31st ACM SIGSAC Conference on Computer and Communications Security, CCS 2024 - Salt Lake City, United States
Duration: 14 Oct 202418 Oct 2024

Publication series

NameCCS 2024 - Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security

Conference

Conference31st ACM SIGSAC Conference on Computer and Communications Security, CCS 2024
Country/TerritoryUnited States
CitySalt Lake City
Period14/10/2418/10/24

Keywords

  • adversarial robustness
  • machine learning
  • malware detection

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Training Robust ML-based Raw-Binary Malware Detectors in Hours, not Months'. Together they form a unique fingerprint.

Cite this