Compact universal k-mer hitting sets

Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, Carl Kingsford

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

Original languageEnglish
Title of host publicationAlgorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings
EditorsMartin Frith, Christian Nørgaard Storm Pedersen
PublisherSpringer Verlag
Pages257-268
Number of pages12
ISBN (Print)9783319436807
DOIs
StatePublished - 1 Jan 2016
Event16th International Workshop on Algorithms in Bioinformatics, WABI 2016 - Aarhus, Denmark
Duration: 22 Aug 201624 Aug 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9838 LNCS

Conference

Conference16th International Workshop on Algorithms in Bioinformatics, WABI 2016
Country/TerritoryDenmark
CityAarhus
Period22/08/1624/08/16

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Compact universal k-mer hitting sets'. Together they form a unique fingerprint.

Cite this