Blogging My Academic Life: Constructing Phylogenetic Trees

Phylogenetic trees are constructs that show the evolutionary relationships between nucleotide or amino acid sequences of homologous genes in different or the same species by analyzing the differences in the sequences and inferring the sequences of changes leading to the input data. Phylogenetic trees are usually constructed after sequencing a gene as a form of error checking: how does the constructed tree compare to the established phylogenetic relationships between the compared species?

For this example, the lab tech had cloned an heat shock protein 60 (Hsp60) cDNA from a grizzly bear cDNA library using primers designed against human Hsp60 (HSPD1 ). This cDNA was sequenced, and I was assigned to construct a phylogenetic tree. I will be compiling the known homologs of Hsp60 from NCBI Homologene, looking for other homologs not in the curated Homologene database using BLAST, aligning these sequences using ClustalX, generating a maximum parsimony tree using PHYLIP, and finally, rendering the tree using Archaeopteryx.

Hsp60 (AKA chaperonin, HSPD1) in mammals is primarily known as a mitochondrial protein that is involved in the import and proper folding of other mitochondrial proteins (see any number of reviews here). It is a highly conserved protein, homologs of which can be found in almost all forms of life, including bacteria, fungi, and plants. Recent research has found that Hsp60 can be found also in the cytosolic compartment (ref), where it is involved in the regulation of apoptosis (ref). Hsp60 is also found in peripheral circulation (ref). The concentrations of circulating Hsp60 may be indicative of some disease states, including coronary heart disease (ref), osteoporosis (ref), prostate cancer (ref), and psychosocial stress (ref). I'm interested in Hsp60 in ursids because it may be a good indicator of chronic stress status, which is one reason why we cloned and sequenced this protein.

First, I need to find sequences with which to compare ursid Hsp60. I will be looking for related sequences with BLAST.

We'll be using nucleotide BLAST.

I'll paste my ursid Hsp60 sequence into the big box and initiate a search in the non-redundant nucleotide database.

And I get results! But these don't look like Hsp60 sequences.

It doesn't matter at this point, anyways. I'll just download all of these sequences by scrolling down and checking the "Select all sequences" box and clicking "Get Selected Sequences".

That'll bring me to this screen.

To grab the FASTA-formatted sequences, click on "Display Settings" and then "FASTA (text)". Copy and paste the sequences to a text document. At this point, if you don't like the FASTA headers (for example: >gi|10923782|ref|XM_1281221) that came with the sequences, you should change them now to avoid headaches later. Just replace everything after the bracket (">") with whatever you like. I prefer using the genus or species name and the gene abbreviation (for example: ">UrsusHsp60"). Avoid spaces because some programs don't like them. Additionally, I like to remove redundant cDNA sequences that are derived from different tissues in the same species if I know that the sequences aren't tissue specific or don't care about intra-species differences.

So I didn't actually find any Hsp60 nucleotide sequences using BLAST. This highly suggests that there's something wrong with my sequence, such as we cloned the wrong gene! To confirm, I could get known Hsp60 sequences by looking in a curated database. In some cases, you might prefer to skip the BLAST step and begin in the curated database. I like to use NCBI Gene and Homologene.

Searching for HSPD1 in Gene gives us a list of Hsp60 genes in different species. Homo sapiens HSPD1 is a good place to start (human genes are generally better characterized than most other species). Clicking on the link brings us to the curated human HSPD1 page, which is full of useful information about the gene.

The Homologene link is near the bottom right of the window.

We can download the mRNA sequences of HSPD1 for all of these species by clicking the "Download" link on the right, and then selecting to download the mRNA sequences from the dropdown menu.

Copy and paste the resulting list to the text document we created earlier and edit the sequences as necessary. Now, the phylogenetics package PHYLIP requires input in the form of a file containing the multiple sequence alignment (MSA) of all the Hsp60 sequences. We'll do this in ClustalX 2.0.12. Load the file containing all the FASTA-formatted sequences.

The sequences will load up and look something like this. They're not aligned yet, so select "Do Complete Alignment" from the Alignment menu. You can choose where to save the output.

Here are the results. You can scroll to the right to see the rest of the alignment and/or open the .aln file.

If everything looks good, save the alignment as a PHYLIP file. It may give you a dialogue box warning you that the FASTA headers will be truncated. That's fine.

Here is what the PHYLIP-formatted MSA looks like.

We can also construct a phylogenetic tree in ClustalX, but the only supported methods are the distance-based neighbour-joining (N-J) and unweighted pair group method with arithmetic mean (UPGMA). These methods are fairly fast in comparison to character-based or Bayesian methods, but they're generally considered to be inferior in that they don't necessarily use all the information that is available in the sequences to construct their trees. Additionally, ClustalX supports bootstrapping with N-J trees. Bootstrapping is a statistical method that runs the tree building simulation over a user-specified number of iterations. With each iteration, the program will randomly re-sample the MSA to generate the initial unresolved tree from which the program bases its distance calculations. The final bootstrapped tree reports a bootstrap statistic for each branch, which indicates the proportion of bootstrap iterations that results in the reported branch. A higher proportion indicates higher confidence in the tree, although I don't think there's a standardized cut-off. Generally, most bootstrap simulations are run with 1000 or more iterations. In ClustalX, select "Bootstrap N-J Tree" from the Trees menu.

The Random number generator seed is a number that is used in the pseudo-random number generator to re-sample the MSA. I don't believe that the chosen number really affects anything, so just choose any number. The output .phb file looks like this:

This can be interpreted into a tree in Archaeopteryx.

We can see from the tree that our ursid Hsp60 sequence is clearly not closely related to the other Hsp60 sequences. In fact, it looks like we sequenced HspA14, which is a poorly characterized member of the Hsp70 family. We can play with the tree a bit to make it prettier, but I would rather calculate a maximum parsimony tree in PHYLIP before I bother.

PHYLIP is a command-line program with no GUI. I'm going to use the dnapars (maximum parsimony for DNA) package to find the most parsimonious phylogenetic tree. In this workflow, I will take the ClustalX MSA and use the seqboot package to randomly re-sample the MSA for a desired number of times (1000 bootstrap iterations, in this case). The sequence of the MSA will be re-ordered while keeping the alignment intact, or in other words, splitting the alignment into blocks of a specific size and randomly mixing them to form 1000 new sequences with the alignments intact. I'll feed these re-sampled MSAs into dnapars, and it'll construct a tree for each alignment. I'll then take these 1000 trees and construct a consensus tree in the consense package. Finally, I'll use Archaeopteryx to visualize the tree. When I first start seqboot, it'll ask for the input file (the ClustalX MSA):

Then it'll give you a menu with a few options. The major ones to worry about are:

J – Bootstrap, Jackknife, Permute, Rewrite? We want bootstrap.
R – How many replicates? Anything above 1000 iterations is good. You may want to run a bootstrap trial with less than 100 iterations through dnapars and consense to make sure everything looks alright before committing the time for the full 1000 iterations.

seqboot will then ask for a random number seed and the output file name (I'm using blastseqboot). When it's done, we can move on to dnapars.

dnapars will ask for the input file (the pre-bootstrapped output file from seqboot).

Then it'll ask for an output file name.

Then it'll give you a menu full of options. The important ones are:

J – Randomize input order of sequences? This has already been done in seqboot, so select No.

O – Outgroup root? If you have an outgroup, then select the species number (in order of the input file). The outgroup should ideally be a somewhat distantly related sequence, but not so distant as to have no similarity to the sequences to be analyzed. In this example, I may use one of the bacterial Hsp60 sequences as the outgroup. There's also no problem with having an unrooted tree in most cases, but dnapars will use species 1 as default if you don't select one yourself

M – Analyze multiple data sets? Since we're doing 1000 bootstrap simulations, set this to 1000 data sets.

Maximum parsimony is quite a bit slower than N-J. Bootstrapping 1000 N-J simulations takes about a minute on my old early 2006 Macbook. Bootstrapping 1000 MP simulations on the same machine will take about three and a half hours.

When it's finally done, it will spit out a file with all 1000 generated phylogenetic trees. We now need to find the consensus tree using the consense package. Again, it will ask for the input file (the very large file containing all the output trees) and give a menu of options. It should be self explanatory.

Now to visualize the output in Archaeopteryx:

The blue numbers indicate the bootstrap statistics. It looks like all the HspA14 (and the ursid Hsp60) genes are clustered together, away from the actual Hsp60 sequences. If you want to edit anything, you'll have to export the file and edit it in a text editor. I prefer editing the tree as PhyloXML because markup is easier to read than NEWICK or NEXUS formats. You'll generally want the bootstrap statistics as a percentage, and we'll need to hand edit the clade names.

After some edits, it'll look like this:

And that's it!

Blogging My Academic Life

Sunday, 1 May 2011

Constructing Phylogenetic Trees

No comments:

Post a Comment