The complexity of aggregates over extractions by regular expressions

Johannes Doleschal, Noa Bratman, Benny Kimelfeld, Wim Martens

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Regular expressions with capture variables, also known as "regex-formulas,"extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Original languageEnglish
Title of host publication24th International Conference on Database Theory, ICDT 2021
EditorsKe Yi, Zhewei Wei
ISBN (Electronic)9783959771795
DOIs
StatePublished - 1 Mar 2021
Event24th International Conference on Database Theory, ICDT 2021 - Nicosia, Cyprus
Duration: 23 Mar 202126 Mar 2021

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume186

Conference

Conference24th International Conference on Database Theory, ICDT 2021
Country/TerritoryCyprus
CityNicosia
Period23/03/2126/03/21

Keywords

  • Aggregation functions
  • Document spanners
  • Information extraction
  • Regular expressions

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'The complexity of aggregates over extractions by regular expressions'. Together they form a unique fingerprint.

Cite this