Recursive programs for document spanners

Liat Peterfreund, Balder Ten Cate, Ronald Fagin, Benny Kimelfeld

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.

Original languageEnglish
Title of host publication22nd International Conference on Database Theory, ICDT 2019
EditorsPablo Barcelo, Marco Calautti
ISBN (Electronic)9783959771016
DOIs
StatePublished - Mar 2019
Event22nd International Conference on Database Theory, ICDT 2019 - Lisbon, Portugal
Duration: 26 Mar 201928 Mar 2019

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume127

Conference

Conference22nd International Conference on Database Theory, ICDT 2019
Country/TerritoryPortugal
CityLisbon
Period26/03/1928/03/19

Keywords

  • Datalog
  • Document spanners
  • Information extraction
  • Polynomial time
  • Recursion
  • Regular expressions

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Recursive programs for document spanners'. Together they form a unique fingerprint.

Cite this