DataDetective: Dataset Watermarking for Leaker Identification in ML Training

Noa Wegerhoff, Avishag Shapira, Yuval Elovici, Asaf Shabtai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data owners (distributors) often share machine learning (ML) datasets with third-party collaborators (agents) for various purposes. While such collaborations can be mutually beneficial, they also introduce the risk of data leakage, i.e., the deliberate or accidental disclosure of sensitive ML datasets to unauthorized parties. Consequently, distributors may lose their intellectual property, experience reduced revenue, or violate data privacy regulations. In this paper, we propose a novel black-box dataset watermarking approach called DataDetective, which not only detects the unauthorized use of protected datasets but also identifies the agent responsible for the leakage. DataDetective, which leverages a backdoor technique, is composed of two processes: In the dataset watermarking process a unique watermark signature is embedded into each agent's version of the dataset, which embeds detectable, agent-specific behaviors in any model trained on the data. In the leaker identification process the watermark signature embedded in a suspected model is identified and compared to the signatures of all agents, to identify the leaking agent. Extensive evaluations on benchmark datasets in the computer vision domain demonstrate our method's effectiveness; DataDetective achieved a perfect leaker identification rate with just 1% of the data watermarked. Moreover, DataDetective maintains the model's performance with a negligible impact on model accuracy. By providing a verifiable and robust solution for leaker attribution, DataDetective enhances accountability in collaborative ML environments. For more details, the code is available at https://github.com/NoaWegerhoff/data-detective.

Original languageAmerican English
Title of host publicationECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings
EditorsUlle Endriss, Francisco S. Melo, Kerstin Bach, Alberto Bugarin-Diz, Jose M. Alonso-Moral, Senen Barro, Fredrik Heintz
PublisherIOS Press BV
Pages2442-2451
Number of pages10
ISBN (Electronic)9781643685489
DOIs
StatePublished - 16 Oct 2024
Event27th European Conference on Artificial Intelligence, ECAI 2024 - Santiago de Compostela, Spain
Duration: 19 Oct 202424 Oct 2024

Publication series

NameFrontiers in Artificial Intelligence and Applications
Volume392

Conference

Conference27th European Conference on Artificial Intelligence, ECAI 2024
Country/TerritorySpain
CitySantiago de Compostela
Period19/10/2424/10/24

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence

Cite this