TY - GEN
T1 - Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
AU - Kharitonov, Eugene
AU - Riviere, Morgane
AU - Synnaeve, Gabriel
AU - Wolf, Lior
AU - Mazare, Pierre Emmanuel
AU - Douze, Matthijs
AU - Dupoux, Emmanuel
N1 - Publisher Copyright: © 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - Contrastive Predictive Coding (CPC), based on predicting future segments of speech from past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs compared to other methods on unsupervised evaluation benchmarks. Here, we intro-duce WavAugment, a time-domain data augmentation library which we adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure). We find that applying augmentation only to the segments from which the CPC prediction is performed yields better results than applying it also to future segments from which the samples (both positive and negative) of the contrastive loss are drawn. After selecting the best combination of pitch modification, additive noise and reverberation on unsupervised metrics on LibriSpeech (with a gain of 18-22% relative on the ABX score), we apply this combination without any change to three new datasets in the Zero Resource Speech Benchmark 2017 and beat the state-of-the-art using out-of-domain training data. Finally, we show that the data-augmented pretrained features improve a downstream phone recognition task in the Libri-light semi-supervised setting (10 min, 1 h or 10 h of labelled data) reducing the PER by 15% relative.
AB - Contrastive Predictive Coding (CPC), based on predicting future segments of speech from past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs compared to other methods on unsupervised evaluation benchmarks. Here, we intro-duce WavAugment, a time-domain data augmentation library which we adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure). We find that applying augmentation only to the segments from which the CPC prediction is performed yields better results than applying it also to future segments from which the samples (both positive and negative) of the contrastive loss are drawn. After selecting the best combination of pitch modification, additive noise and reverberation on unsupervised metrics on LibriSpeech (with a gain of 18-22% relative on the ABX score), we apply this combination without any change to three new datasets in the Zero Resource Speech Benchmark 2017 and beat the state-of-the-art using out-of-domain training data. Finally, we show that the data-augmented pretrained features improve a downstream phone recognition task in the Libri-light semi-supervised setting (10 min, 1 h or 10 h of labelled data) reducing the PER by 15% relative.
KW - contrastive predictive coding
KW - data augmentation
KW - unsupervised representation learning
UR - http://www.scopus.com/inward/record.url?scp=85103929279&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383605
DO - 10.1109/SLT48900.2021.9383605
M3 - منشور من مؤتمر
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 215
EP - 222
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
Y2 - 19 January 2021 through 22 January 2021
ER -