TY - GEN
T1 - Example-based cross-modal denoising
AU - Segev, Dana
AU - Schechner, Yoav Y.
AU - Elad, Michael
PY - 2012
Y1 - 2012
N2 - Widespread current cameras are part of multisensory systems with an integrated computer (smartphones). Computer vision thus starts evolving to cross-modal sensing, where vision and other sensors cooperate. This exists in humans and animals, reflecting nature, where visual events are often accompanied with sounds. Can vision assist in denoising another modality? As a case study, we demonstrate this principle by using video to denoise audio. Unimodal (audio-only) denoising is very difficult when the noise source is non-stationary, complex (e.g., another speaker or music in the background), strong and not individually accessible in any modality (unseen). Cross-modal association can help: a clear video can direct the audio estimator. We show this using an example-based approach. A training movie having clear audio provides cross-modal examples. In testing, cross-modal input segments having noisy audio rely on the examples for denoising. The video channel drives the search for relevant training examples. We demonstrate this in speech and music experiments.
AB - Widespread current cameras are part of multisensory systems with an integrated computer (smartphones). Computer vision thus starts evolving to cross-modal sensing, where vision and other sensors cooperate. This exists in humans and animals, reflecting nature, where visual events are often accompanied with sounds. Can vision assist in denoising another modality? As a case study, we demonstrate this principle by using video to denoise audio. Unimodal (audio-only) denoising is very difficult when the noise source is non-stationary, complex (e.g., another speaker or music in the background), strong and not individually accessible in any modality (unseen). Cross-modal association can help: a clear video can direct the audio estimator. We show this using an example-based approach. A training movie having clear audio provides cross-modal examples. In testing, cross-modal input segments having noisy audio rely on the examples for denoising. The video channel drives the search for relevant training examples. We demonstrate this in speech and music experiments.
UR - http://www.scopus.com/inward/record.url?scp=84866696873&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/CVPR.2012.6247712
DO - https://doi.org/10.1109/CVPR.2012.6247712
M3 - منشور من مؤتمر
SN - 9781467312264
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 486
EP - 493
BT - 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2012
T2 - 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2012
Y2 - 16 June 2012 through 21 June 2012
ER -