Abstract
The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with time-frequency (TF) attention aimed at noisy and reverberant environments. We dub this new architecture Separation TF Attention Network (Sep-TFAnet). Additionally, we introduce a variant of the separation network, Sep-TFAnetVAD, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture, with several modifications. Instead of using a learned encoder and decoder, we employ the short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for analysis and synthesis, respectively. Our system is specifically developed for human-robot interaction and supports block processing mode. While considerable progress has been made in separating overlapping speech signals, most studies have primarily focused on mixtures of simulated-reverberated speech signals, lacking real-world scenarios. To address this limitation, we introduce the ARImulti-mic dataset, which incorporates real-world experiments. These recordings were carried out in the acoustic laboratory at Bar-Ilan University and captured by a humanoid robot. Throughout this paper, we focus on a single-microphone setting. Extensive evaluation of the proposed methods using this dataset and carefully simulated data demonstrated advantages over competing methods. The ARImulti-mic dataset is available at DataPort, and examples of our algorithm applied to this dataset can be found on the project page: https://Sep-TFAnet.github.io.
Original language | English |
---|---|
Article number | 18 |
Journal | Eurasip Journal on Audio, Speech, and Music Processing |
Volume | 2025 |
Issue number | 1 |
DOIs | |
State | Published - Dec 2025 |
Keywords
- Speaker separation
- Temporal convolutional networks
- Voice activity detection
All Science Journal Classification (ASJC) codes
- Acoustics and Ultrasonics
- Electrical and Electronic Engineering