Linguistic Knowledge Within Handwritten Text Recognition Models: A Real-World Case Study

Samuel Londner, Yoav Phillips, Hadar Miller, Nachum Dershowitz, Tsvi Kuflik, Moshe Lavee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language. We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic). All our data, code, and models are openly available at https://github.com/anutkk/ilmja.

Original languageAmerican English
Title of host publicationDocument Analysis and Recognition – ICDAR 2023 - 17th International Conference, Proceedings
EditorsGernot A. Fink, Rajiv Jain, Koichi Kise, Richard Zanibbi
PublisherSpringer Science and Business Media Deutschland GmbH
Pages147-164
Number of pages18
ISBN (Print)9783031416842
DOIs
StatePublished - 2023
Event17th International Conference on Document Analysis and Recognition, ICDAR 2023 - San José, United States
Duration: 21 Aug 202326 Aug 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14190 LNCS

Conference

Conference17th International Conference on Document Analysis and Recognition, ICDAR 2023
Country/TerritoryUnited States
CitySan José
Period21/08/2326/08/23

Keywords

  • Handwritten text recognition
  • Hebrew manuscripts
  • Language model
  • Optical character recognition
  • Transfer learning

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Linguistic Knowledge Within Handwritten Text Recognition Models: A Real-World Case Study'. Together they form a unique fingerprint.

Cite this