Shamela: A Large-Scale Historical Arabic Corpus

Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.
Original languageUndefined/Unknown
Title of host publicationProceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Place of PublicationOsaka, Japan
Pages45-53
Number of pages9
StatePublished - 1 Dec 2016

Cite this