Parsing PubMed for email addresses in author affiliations
- Post by: Joseph Hughes
- October 13, 2014
- 1 Comment
Recently, we wanted to send out a survey for the International Committee on Taxonomy of Viruses (ICTV) to a large number of authors who have recently published in a virology journal. Fortunately, PubMed stores author affiliations and the email address is also sometimes present in the affiliation. We decided to target the following journals: Journal of Virology; Journal of General Virology, Virology, Virus Research, Antiviral Research, Viruses and Journal of Medical Virology. A lot of the difficult work can be done using E-utilities to generate the URL for the search. As we may be retrieving a large number of emails, we need to retrieve the results from the URL query in batches. We then want to extract the affiliations and the emails from the affiliations using:
As we didn’t want to send all the emails off in one go, we split the output into multiple batches of 100 emails. Here’s the full code also available as a Gist on Github:
#!/usr/bin/perl -w # A perlscript written by Joseph Hughes, University of Glasgow # use this perl script to parse the email addressed from the affiliations in PubMed use strict; use LWP::Simple; my ($query,@queries); #Query the Journal of Virology from 2014 until the present (use 3000) $query = 'journal+of+virology[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Journal of General Virology $query = 'journal+of+general+virology[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Virology $query = 'virology[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Archives of Virology $query = 'archives+of+virology[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Virus Research $query = 'virus+research[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Antiviral Research $query = 'antiviral+research[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Viruses $query = 'viruses[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; push(@queries,$query); #Journal of Medical Virology $query = 'journal+of+medical+virology[journal]+AND+2014[Date+-+Publication]:3000[Date+-+Publication]'; # global variables push(@queries,$query); my %emails; my $emailcnt=0; my $count=1; #assemble the esearch URL foreach my $query (@queries){ my $base = ''; my $url = $base . "esearch.fcgi?db=pubmed&term=$query&usehistory=y"; #post the esearch URL my $output = get($url); #parse WebEnv, QueryKey and Count (# records retrieved) my $web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/); my $key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/); my $count = $1 if ($output =~ /<Count>(\d+)<\/Count>/); #retrieve data in batches of 500 my $retmax = 500; for (my $retstart = 0; $retstart < $count; $retstart += $retmax) { my $efetch_url = $base ."efetch.fcgi?db=pubmed&WebEnv=$web"; $efetch_url .= "&query_key=$key&retmode=xml"; my $efetch_out = get($efetch_url); my @matches = $efetch_out =~ m(<Affiliation>(.*)</Affiliation>)g; #print "$_\n" for @matches; for my $match (@matches){ if ($match=~/\s([a-zA-Z0-9\.\_\-]+\@[a-zA-Z0-9\.\_\-]+)$/){ my $email=$1; $email=~s/\.$//; $emails{$email}++; } } } my $cnt= keys %emails; print "$query\n$cnt\n"; } print "Total number of emails: "; my $cnt= keys %emails; print "$cnt\n"; my @email = keys %emails; my @VAR; push @VAR, [ splice @email, 0, 100 ] while @email; my $batch=100; foreach my $VAR (@VAR){ open(OUT, ">Set_$batch\.txt") || die "Can't open file!\n"; print OUT join(",",@$VAR); close OUT; $batch=$batch+100; }
Here are the email counts: Journal of Virology = 634: Journal of General Virology = 169; Virology = 546: Virus Research = 425; Antiviral Research = 252; Viruses = 892; Journal of Medical Virology = 0.
The Journal of Medical Virology doesn’t release the email addresses of authors and if the information is not used responsibly, then a number of other journals might go that way to as discussed in “E-mail Address Harvesting on PubMed—A Call for Responsible Handling of E-mail Addresses“.
If you re-run this script, you might find a few more hits as more papers get published this year.