Author Archives: Joseph Hughes

Recombination detection programs

A critical step before phylogenetic analysis and molecular selection analysis is to detect recombination and either remove recombinant sequences or partition the alignment into different spans that a recombination-free. The problem is that there are so many different recombination detection programs available. These have been nicely review by Posada (2002) and some of the programs have been benchmarked by Kosakovsky Pond and Frost (2005).

I decided to compile all the programs that are available but there are so many that I am sure I have missed some. If you know of any others, please leave a comment below.

I have split the programs into the same four different categories as Posada (2002) but some programs implement multiple methods so it gets a bit tricky. The size of the font relates to the number of citations each program/publication has received in Google Scholar. As you can see, there are some clear favourites.

You can click on the names of the programs and this should either take you to the publication abstract on Pubmed or to the website for the software.


Short command lines for manipulation FASTQ and FASTA sequence files

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta


To convert a fastq file to fasta in a single line using sed


Dirty way to count the number of sequences in a fastq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

One liner to remove the description information from a fasta file and just keep the identifier


Get all the identifier names from a fasta file


Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:


Convert from a two column text tab-delimited file (ID and sequence) to a fasta file


Get the length of a fasta sequence (the sequence must in singleline)


I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.


Continue reading

Parsing PubMed for email addresses in author affiliations


Recently, we wanted to send out a survey for the International Committee on Taxonomy of Viruses (ICTV) to a large number of authors who have recently published in a virology journal. Fortunately, PubMed stores author affiliations and the email address is also sometimes present in the affiliation. We decided to target the following journals: Journal of Virology; Journal of General Virology, Virology, Virus Research, Antiviral Research, Viruses and Journal of Medical Virology. A lot of the difficult work can be done using E-utilities to generate the URL for the search. As we may be retrieving a large number of emails, we need to retrieve the results from the URL query in batches. We then want to extract the affiliations and the emails from the affiliations using:

As we didn’t want to send all the emails off in one go, we split the output into multiple batches of 100 emails.   Here’s the full code also available as a Gist on Github:

Here are the email counts: Journal of Virology = 634: Journal of General Virology = 169; Virology = 546: Virus Research = 425; Antiviral Research = 252; Viruses = 892; Journal of Medical Virology = 0.

The Journal of Medical Virology doesn’t release the email addresses of authors and if the information is not used responsibly, then a number of other journals might go that way to as discussed in “E-mail Address Harvesting on PubMed—A Call for Responsible Handling of E-mail Addresses“.

If you re-run this script, you might find a few more hits as more papers get published this year.