Abstract
Many speech enhancement methods require perceptual quality metrics for evaluation. The “holy grail” of perceptual speech quality assessment is human subjective ratings, known as the mean opinion score. However, acquiring human ratings is time-consuming, laborious, and expensive. Existing objective quality metrics, on the other hand, are efficient and easy to compute but do not correlate well with human ratings. In this paper, we propose a relatively lightweight deep-learning-based model to predict the human ratings of speech signals. Since it is differentiable, it can be easily employed as a perceptual regularization to improve existing deep-learning-based speech enhancement methods. Experimental results demonstrate that the predictions of our proposed model correlate well with human judgments. We present application in speech enhancement and show that, interestingly, while there is a degradation in performance in terms of traditional objective metrics, there is a significant improvement in the perceptual quality and the naturalness of the enhanced speech.
Original language | English |
---|---|
Pages (from-to) | 159-163 |
Number of pages | 5 |
Journal | Pattern Recognition Letters |
Volume | 166 |
DOIs | |
State | Published - Feb 2023 |
Keywords
- Mean opinion score
- Speech enhancement
- Speech naturalness assessment
- Speech quality assessment
All Science Journal Classification (ASJC) codes
- Software
- Artificial Intelligence
- Signal Processing
- Computer Vision and Pattern Recognition