Convert NCBI Protein GI to Genome Accession

A few days back I posted a question on BioStars about getting genome accession numbers for a list of protein GIs. I had a long list of protein GI and I wanted the genome accession number for each protein GI (if there is one in NCBI databases) but without downloading files for each protein GI in genbank or xml format.

One way to  do this is use db2db. However you can only use db2db if you have a list of protein accession number for the protein GIs of interest. Also I wanted to include this step as part of a pipeline and automate it. db2db is a web based approach so doesn’t allow for easy automation.

I wrote the following script that first uses NCBI utilities to convert the list of protein GI to nucleotide GI and then fetches genome accession numbers for those nucleotide GIs.

#!/usr/bin/perl

use Bio::DB::EUtilities;

#my @ids     = qw(817524604 726965494);

my $infile = $ARGV[0];

my @ids;

open (IN,"$infile")||die "can't open $infile\n";

while(<IN>)
{
  chomp($_);
  my @ids=$_;
# print @ids."\n";	
  my $factory = Bio::DB::EUtilities->new(-eutil          => 'elink',
                                       -email          => 'mymail@foo.bar',
                                       -db             => 'nucleotide',
                                       -dbfrom         => 'protein',
                                       -correspondence => 1,
                                       -id             => \@ids);
 
  # iterate through the LinkSet objects
  while (my $ds = $factory->next_LinkSet) 
  {
    #print "   Link name: ",$ds->get_link_name,"\n";
    my $protid = join(',',$ds->get_submitted_ids);
    print "Protein ID:" . $protid ."\t";
    #print "Protein ID: ",join(',',$ds->get_submitted_ids),"\t";
    my $nucid = join(',',$ds->get_ids);
    print "Nuc ID:" . $nucid ."\t";
    my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
        				   -db      => 'nucleotide',
                                       	   -id      => $nucid,
                                       	   -email   => 'mymail@foo.bar',
                                       	   -rettype => 'acc');
   my @accs = split(m{\n},$factory->get_Response->content);
   print "Genome Accession: " .join(',',@accs), "\n";
   }
}

This script will take a file with the list of protein GI as an input and can be run as

perl ProteinGIToGenomeAcc.pl test_gi_list

where test_gi_list file contains the following protein GIs

752901017
675510735
674269272
360086542

This command should provide the following output.

Protein ID:752901017 Nuc ID:752901008 Genome Accession: NC_026421.1
Protein ID:675510735 Nuc ID:675510705 Genome Accession: NC_024771.1
Protein ID:674269272 Nuc ID:674269192 Genome Accession: NC_024696.1
Protein ID:360086542 Nuc ID:360086494 Genome Accession: NC_016448.1
Categories: Uncategorized