Abstract
Spoken term detection (STD) is the task of determining whether and where a given word or phrase appears in a given segment of speech. Algorithms for STD are often aimed at maximizing the gap between the scores of positive and negative examples. As such they are focused on ensuring that utterances where the term appears are ranked higher than utterances where the term does not appear. However, they do not determine a detection threshold between the two. In this paper, we propose a new approach for setting an absolute detection threshold for all terms by introducing a new calibrated loss function. The advantage of minimizing this loss function during training is that it aims at maximizing not only the relative ranking scores, but also adjusts the system to use a fixed threshold and thus maximizes the detection accuracy rates. We use the new loss function in the structured prediction setting and extend the discriminative keyword spotting algorithm for learning the spoken term detector with a single threshold for all terms. We further demonstrate the effectiveness of the new loss function by training a deep neural Siamese network in a weakly supervised setting for template-based STD, again with a single fixed threshold. Experiments with the TIMIT, Wall Street Journal (WSJ), and Switchboard corpora showed that our approach not only improved the accuracy rates when a fixed threshold was used but also obtained higher area under curve (AUC).
Original language | English |
---|---|
Article number | 8070931 |
Pages (from-to) | 1310-1317 |
Number of pages | 8 |
Journal | IEEE Journal on Selected Topics in Signal Processing |
Volume | 11 |
Issue number | 8 |
DOIs | |
State | Published - Dec 2017 |
Keywords
- AUC maximization
- Spoken term detection
- deep-neural networks
- keyword spotting
- structured prediction
All Science Journal Classification (ASJC) codes
- Signal Processing
- Electrical and Electronic Engineering