Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index

Asaf Levi, Philip Shilane, Sarai Sheinvald, Gala Yadgar

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the realm of information retrieval, the need to maintain reliable term-indexing has grown more acute in recent years, with vast amounts of ever-growing online data searched by a large number of search-engine users and used for data mining and natural language processing. At the same time, an increasing portion of primary storage systems employ data deduplication, where duplicate logical data chunks are replaced with references to a unique physical copy. We show that indexing deduplicated data with deduplication-oblivious mechanisms might result in extreme inefficiencies: the index size would increase in proportion to the logical data size, regardless of its duplication ratio, consuming excessive storage and memory and slowing down lookups. In addition, the logically sequential accesses during index creation would be transformed into random and redundant accesses to the physical chunks. Indeed, to the best of our knowledge, term indexing is not supported by any deduplicating storage system. In this paper, we propose the design of a deduplication-aware term-index that addresses these challenges. IDEA maps terms to the unique chunks that contain them, and maps each chunk to the files in which it is contained. This basic design concept improves the index performance and can support advanced functionalities such as inline indexing, result ranking, and proximity search. Our prototype implementation based on Lucene (the search engine at the core of Elasticsearch) shows that IDEA can reduce the index size and indexing time by up to 73% and 94%, respectively, and reduce term-lookup latency by up to 82% and 59% for single and multi-term queries, respectively.

Original languageEnglish
Title of host publicationProceedings of the 22nd USENIX Conference on File and Storage Technologies, FAST 2024
Pages243-258
Number of pages16
ISBN (Electronic)9781939133380
StatePublished - 2024
Externally publishedYes
Event22nd USENIX Conference on File and Storage Technologies, FAST 2024 - Santa Clara, United States
Duration: 27 Feb 202429 Feb 2024

Publication series

NameProceedings of the 22nd USENIX Conference on File and Storage Technologies, FAST 2024

Conference

Conference22nd USENIX Conference on File and Storage Technologies, FAST 2024
Country/TerritoryUnited States
CitySanta Clara
Period27/02/2429/02/24

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index'. Together they form a unique fingerprint.

Cite this