A deep architecture for audio-visual voice activity detection in the presence of transients

Ido Ariav, David Dov, Israel Cohen

Research output: Contribution to journalArticlepeer-review

Abstract

We address the problem of voice activity detection in difficult acoustic environments including high levels of noise and transients, which are common in real life scenarios. We consider a multimodal setting, in which the speech signal is captured by a microphone, and a video camera is pointed at the face of the desired speaker. Accordingly, speech detection translates to the question of how to properly fuse the audio and video signals, which we address within the framework of deep learning. Specifically, we present a neural network architecture based on a variant of auto-encoders, which combines the two modalities, and provides a new representation of the signal, in which the effect of interferences is reduced. To further encode differences between the dynamics of speech and interfering transients, the signal, in this new representation, is fed into a recurrent neural network, which is trained in a supervised manner for speech detection. Experimental results demonstrate improved performance of the proposed deep architecture compared to competing multimodal detectors.

Original languageEnglish
Pages (from-to)69-74
Number of pages6
JournalSignal Processing
Volume142
DOIs
StatePublished - Jan 2018

Keywords

  • Audio-visual speech processing
  • Auto-encoder
  • Recurrent neural networks
  • Voice activity detection

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'A deep architecture for audio-visual voice activity detection in the presence of transients'. Together they form a unique fingerprint.

Cite this