TY - GEN
T1 - Multiple speaker localization using mixture of Gaussian model with manifold-based centroids
AU - Bross, Avital
AU - Laufer-Goldshtein, Bracha
AU - Gannot, Sharon
N1 - Publisher Copyright: © 2021 European Signal Processing Conference, EUSIPCO. All rights reserved.
PY - 2021/1/24
Y1 - 2021/1/24
N2 - A data-driven approach for multiple speakers localization in reverberant enclosures is presented. The approach combines semi-supervised learning on multiple manifolds with unsupervised maximum likelihood estimation. The relative transfer functions (RTFs) are used in both stages of the proposed algorithm as feature vectors, which are known to be related to source positions. The microphone positions are not known. In the training stage, a nonlinear, manifold-based, mapping between RTFs and source locations is inferred using single-speaker utterances. The inference procedure utilizes two RTF datasets: A small set of RTFs with their associated position labels; and a large set of unlabelled RTFs. This mapping is used to generate a dense grid of localized sources that serve as the centroids of a Mixture of Gaussians (MoG) model, used in the test stage of the algorithm to cluster RTFs extracted from multiple-speakers utterances. Clustering is applied by applying the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. A preliminary experimental study, with either two or three overlapping speakers in various reverberation levels, demonstrates that the proposed scheme achieves high localization accuracy compared to a baseline method using a simpler propagation model.
AB - A data-driven approach for multiple speakers localization in reverberant enclosures is presented. The approach combines semi-supervised learning on multiple manifolds with unsupervised maximum likelihood estimation. The relative transfer functions (RTFs) are used in both stages of the proposed algorithm as feature vectors, which are known to be related to source positions. The microphone positions are not known. In the training stage, a nonlinear, manifold-based, mapping between RTFs and source locations is inferred using single-speaker utterances. The inference procedure utilizes two RTF datasets: A small set of RTFs with their associated position labels; and a large set of unlabelled RTFs. This mapping is used to generate a dense grid of localized sources that serve as the centroids of a Mixture of Gaussians (MoG) model, used in the test stage of the algorithm to cluster RTFs extracted from multiple-speakers utterances. Clustering is applied by applying the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. A preliminary experimental study, with either two or three overlapping speakers in various reverberation levels, demonstrates that the proposed scheme achieves high localization accuracy compared to a baseline method using a simpler propagation model.
KW - Manifold-learning
KW - Mixture of Gaussians
KW - Semi-supervised inference
UR - http://www.scopus.com/inward/record.url?scp=85099302730&partnerID=8YFLogxK
U2 - 10.23919/eusipco47968.2020.9287796
DO - 10.23919/eusipco47968.2020.9287796
M3 - منشور من مؤتمر
T3 - European Signal Processing Conference
SP - 895
EP - 899
BT - 28th European Signal Processing Conference, EUSIPCO 2020 - Proceedings
T2 - 28th European Signal Processing Conference, EUSIPCO 2020
Y2 - 24 August 2020 through 28 August 2020
ER -