The HHD Dataset

Irina Rabaev, Berat Kurar Barakat, Alexander Churkin, Jihad El-Sana

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Benchmark datasets are important in document image processing field, as they allow to analyze different approaches and compare their performances in a fair manner. There exist benchmark datasets for several alphabets such as Latin, Arabic and Chinese, but not the Hebrew alphabet. In this paper, a handwritten Hebrew dataset, HHD, is introduced. The HHD dataset is collected from hand-filled forms, and accompanied by their ground truth at character, word and text line levels. Presently, the dataset contains around 1000 document images, and we continue to further enlarge it. To the best of our knowledge, this is the first comprehensive corpus of Hebrew handwritten documents, and we believe it will help leveraging Hebrew documents processing and document processing in general. The dataset can be useful for various research applications, such as word spotting, word recognition, text line alignment, and writer identification. The initial small subset of the HDD for character classification can be downloaded from https://www.cs.bgu.ac.illr-vberatldatalhhd-dataset.zip together with the training and test sets subdivisions. We also provide baseline results for character classification on this initial subset. In the near future, the full HHD dataset will be made freely available to the research community.

Original languageAmerican English
Title of host publicationProceedings - 2020 17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
Pages228-233
Number of pages6
ISBN (Electronic)9781728199665
DOIs
StatePublished - 1 Sep 2020
Event17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020 - Dortmund, Germany
Duration: 7 Sep 202010 Sep 2020

Publication series

NameProceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR
Volume2020-September

Conference

Conference17th International Conference on Frontiers in Handwriting Recognition, ICFHR 2020
Country/TerritoryGermany
CityDortmund
Period7/09/2010/09/20

Keywords

  • Ground truth
  • Handwritten document image dataset
  • Hebrew handwritten documents

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Cite this