Cross-lingual training of summarization systems using annotated corpora in a foreign language

Marina Litvak, Mark Last

Research output: Contribution to journalArticlepeer-review

Abstract

The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as "language-independent," i.e., not based on any language-specific knowledge. Such methods can be used for multilingual summarization, defined in Mani (Automatic summarization. Natural language processing. John Benjamins Publishing Company, Amsterdam, 2001) as "processing several languages, with a summary in the same language as input", but, their performance is usually unsatisfactory due to the exclusion of language-specific knowledge. Moreover, supervised machine learning approaches need training corpora in multiple languages that are usually unavailable for rare languages, and their creation is a very expensive and labor-intensive process. In this article, we describe cross-lingual methods for training an extractive single-document text summarizer called MUSE (MUltilingual Sentence Extractor)-a supervised approach, based on the linear optimization of a rich set of sentence ranking measures using a Genetic Algorithm. We evaluated MUSE's performance on documents in three different languages: English, Hebrew, and Arabic using several training scenarios. The summarization quality was measured using ROUGE-1 and ROUGE-2 Recall metrics. The results of the extensive comparative analysis showed that the performance of MUSE was better than that of the best known multilingual approach (TextRank) in all three languages. Moreover, our experimental results suggest that using the same sentence ranking model across languages results in a reasonable summarization quality, while saving considerable annotation efforts for the end-user. On the other hand, using parallel corpora generated by machine translation tools may improve the performance of a MUSE model trained on a foreign language. Comparative evaluation of an alternative optimization technique-Multiple Linear Regression-justifies the use of a Genetic Algorithm.

Original languageAmerican English
Pages (from-to)629-656
Number of pages28
JournalInformation Retrieval
Volume16
Issue number5
DOIs
StatePublished - 1 Oct 2013

Keywords

  • Cross-lingual training
  • Genetic Algorithm
  • Multilingual summarization

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Cross-lingual training of summarization systems using annotated corpora in a foreign language'. Together they form a unique fingerprint.

Cite this