Single-microphone speaker separation and voice activity detection in noisy and reverberant environments

Renana Opochinsky, Mordehay Moradi, Sharon Gannot

Research output: Contribution to journalArticlepeer-review

Abstract

The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with time-frequency (TF) attention aimed at noisy and reverberant environments. We dub this new architecture Separation TF Attention Network (Sep-TFAnet). Additionally, we introduce a variant of the separation network, Sep-TFAnetVAD, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture, with several modifications. Instead of using a learned encoder and decoder, we employ the short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for analysis and synthesis, respectively. Our system is specifically developed for human-robot interaction and supports block processing mode. While considerable progress has been made in separating overlapping speech signals, most studies have primarily focused on mixtures of simulated-reverberated speech signals, lacking real-world scenarios. To address this limitation, we introduce the ARImulti-mic dataset, which incorporates real-world experiments. These recordings were carried out in the acoustic laboratory at Bar-Ilan University and captured by a humanoid robot. Throughout this paper, we focus on a single-microphone setting. Extensive evaluation of the proposed methods using this dataset and carefully simulated data demonstrated advantages over competing methods. The ARImulti-mic dataset is available at DataPort, and examples of our algorithm applied to this dataset can be found on the project page: https://Sep-TFAnet.github.io.

Original languageEnglish
Article number18
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2025
Issue number1
DOIs
StatePublished - Dec 2025

Keywords

  • Speaker separation
  • Temporal convolutional networks
  • Voice activity detection

All Science Journal Classification (ASJC) codes

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Single-microphone speaker separation and voice activity detection in noisy and reverberant environments'. Together they form a unique fingerprint.

Cite this