Abstract
In this paper, we propose an unsupervised kNN-based approach
for word segmentation in speech utterances. Our method relies
on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its k nearest
neighbors within the training set. Our main assumption is that
a segment containing more than one word would occur less often than a segment containing a single word. Our method does
not require phoneme discovery and is able to operate directly
on pre-trained audio representations. This is in contrast to current methods that use a two-stage approach; first detecting the
phonemes in the utterance and then detecting word-boundaries
according to statistics calculated on phoneme patterns. Experiments on two datasets demonstrate improved results over previous single-stage methods and competitive results on state-ofthe-art two-stage methods.
for word segmentation in speech utterances. Our method relies
on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its k nearest
neighbors within the training set. Our main assumption is that
a segment containing more than one word would occur less often than a segment containing a single word. Our method does
not require phoneme discovery and is able to operate directly
on pre-trained audio representations. This is in contrast to current methods that use a two-stage approach; first detecting the
phonemes in the utterance and then detecting word-boundaries
according to statistics calculated on phoneme patterns. Experiments on two datasets demonstrate improved results over previous single-stage methods and competitive results on state-ofthe-art two-stage methods.
Original language | English |
---|---|
Title of host publication | Proc. Interspeech 2022 |
Pages | 4646-4650 |
Number of pages | 5 |
DOIs | |
State | Published - 2022 |