Abstract
A basic task in bioinformatics is the counting of k-mers in genome strings. The k-mer counting problem is to build a histogram of all substrings of length k in a given genome sequence. We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for \(k\ge 32\). Given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important. While existing k-mer counting tools suffer from excessive memory resource consumption or degrading performance for large k, Gerbil is able to efficiently support large k without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into k-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large k, we outperform state-of-the-art open source k-mer counting tools by up to a factor of 4 for large genome data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The tool is named Gerbil because of its modest resource requirements, which it has in common with the name-giving mammal.
References
Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based \(k\)-mer counting on a PC. BMC Bioinform. 14(1), 1–12 (2013)
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal \(k\)-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Li, Y., et al.: MSPKmerCounter: a fast and memory efficient approach for\(k\)-mer counting (2015). arXiv preprint arXiv:1505.06550
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of \(k\)-mers. Bioinformatics 27(6), 764–770 (2011)
Melsted, P., Pritchard, J.K.: Efficient counting of \(k\)-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 1–7 (2011)
Rizk, G., Lavenier, D., Chikhi, R.: DSK: \(k\)-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Roberts, M., Hunt, B.R., Yorke, J.A., Bolanos, R.A., Delcher, A.L.: A preprocessor for shotgun assembly of large genomes. J. Comput. Biol. 11(4), 734–752 (2004)
Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent \(k\)-mers with cache-efficient algorithms. Bioinformatics 30(14), 1950–1957 (2014)
Sameith, K., Roscito, J.G., Hiller, M.: Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Briefings in Bioinformatics (2016). http://dx.org/10.1093/bib/bbw003
Suzuki, S., Kakuta, M., Ishida, T., Akiyama, Y.: Accelerating identification of frequent \(k\)-mers in DNA sequences with GPU. In: GTC 2014 (2014)
Xavier, B.B., Sabirova, J., Pieter, M., Hernalsteens, J.P., de Greve, H., Goossens, H., Malhotra-Kumar, S.: Employing whole genome mapping for optimal de novo assembly of bacterial genomes. BMC Res. Notes 7(1), 1–4 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Erbert, M., Rechner, S., Müller-Hannemann, M. (2016). Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support. In: Frith, M., Storm Pedersen, C. (eds) Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science(), vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-43681-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43680-7
Online ISBN: 978-3-319-43681-4
eBook Packages: Computer ScienceComputer Science (R0)