Abstract
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
| Original language | English |
|---|---|
| Article number | 16 |
| Journal | Eurasip Journal on Audio, Speech, and Music Processing |
| Volume | 2021 |
| Issue number | 1 |
| DOIs | |
| State | Published - Dec 2021 |
| Externally published | Yes |
Keywords
- DOA
- Tracking
- UNET
All Science Journal Classification (ASJC) codes
- Acoustics and Ultrasonics
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'Dynamically localizing multiple speakers based on the time-frequency domain'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver