Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition

Yonatan Belinkov, Ahmed Ali, James Glass

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Pages81-85
Number of pages5
Volume2019-September
DOIs
StatePublished - 2019
Externally publishedYes
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: 15 Sep 201919 Sep 2019

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Conference

Conference20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Country/TerritoryAustria
CityGraz
Period15/09/1919/09/19

Keywords

  • Analysis
  • End-to-end
  • Graphemes
  • Interpretability
  • Phonemes
  • Speech recognition

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition'. Together they form a unique fingerprint.

Cite this