Update Kraken databases

Kraken is a really good k-mer based classification tool. I frequently use this tool for viral signal detection in metagenomic samples. A number of useful scripts such as updating Kraken databases are provided with the tool. Since the NCBI updated the FTP website structure and decided to phase-out Genbank Idenfiers (GIs), the default Kraken database update scripts do not work. So I decided to write a little python script to update Kraken databases that some Kraken users might find useful.

This script automatically downloads:

  • Human genome – most recent assembly version
  • Complete bacterial reference genomes
  • Complete viral reference genomes
  • Archaeal genomes
  • Reference plasmids sequences

This script takes an optional command-line argument which can be specified as the target location where the data should be downloaded and saved. By default, all files are downloaded in the present working directory.

To change default genome downloads e.g. download mouse reference genome instead of human; please make necessary changes in the code. Feel free to fork this script on github.

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

  • Guille Palou

    Hello Sed Modha, I have been using your script but at some point the following error appears:

    sys:1: DtypeWarning: Columns (20) have mixed types. Specify dtype option on import or set low_memory=False.
    Traceback (most recent call last):
    File “./UpdateKrakenDatabases.py”, line 118, in
    get_fasta_in_kraken_format(‘human_genome.fa’)
    File “./UpdateKrakenDatabases.py”, line 98, in get_fasta_in_kraken_format
    for seq_record in records:
    File “/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/SeqIO/__init__.py”, line 600, in parse
    for r in i:
    File “/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py”, line 478, in parse_records
    record = self.parse(handle, do_features)
    File “/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py”, line 462, in parse
    if self.feed(handle, consumer, do_features):
    File “/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py”, line 430, in feed
    self._feed_header_lines(consumer, self.parse_header())
    File “/aplic/GOOLF/1.6.10/Python/3.3.2/lib/python3.3/site-packages/Bio/GenBank/Scanner.py”, line 1436, in _feed_header_lines
    structured_comment_key = re.search(r”([^#]+){0}$”.format(STRUCTURED_COMMENT_START), data).group(1)
    AttributeError: ‘NoneType’ object has no attribute ‘group’

    Do you know what it can be?

    Thank you very much,

    Guillermo