Dynamically localizing multiple speakers based on the time-frequency domain

Hodaya Hammer, Shlomo E. Chazan, Jacob Goldberger, Sharon Gannot

Research output: Contribution to journalArticlepeer-review

Abstract

In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.

Original languageEnglish
Article number16
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2021
Issue number1
DOIs
StatePublished - Dec 2021
Externally publishedYes

Keywords

  • DOA
  • Tracking
  • UNET

All Science Journal Classification (ASJC) codes

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Dynamically localizing multiple speakers based on the time-frequency domain'. Together they form a unique fingerprint.

Cite this