For this example, the lab tech had cloned an heat shock protein 60 (Hsp60) cDNA from a grizzly bear cDNA library using primers designed against human Hsp60 (HSPD1). This cDNA was sequenced, and I was assigned to construct a phylogenetic tree. I will be compiling the known homologs of Hsp60 from NCBI Homologene, looking for other homologs not in the curated Homologene database using BLAST, aligning these sequences using ClustalX, generating a maximum parsimony tree using PHYLIP, and finally, rendering the tree using Archaeopteryx.
Hsp60 (AKA chaperonin, HSPD1) in mammals is primarily known as a mitochondrial protein that is involved in the import and proper folding of other mitochondrial proteins (see any number of reviews here). It is a highly conserved protein, homologs of which can be found in almost all forms of life, including bacteria, fungi, and plants. Recent research has found that Hsp60 can be found also in the cytosolic compartment (ref), where it is involved in the regulation of apoptosis (ref). Hsp60 is also found in peripheral circulation (ref). The concentrations of circulating Hsp60 may be indicative of some disease states, including coronary heart disease (ref), osteoporosis (ref), prostate cancer (ref), and psychosocial stress (ref). I'm interested in Hsp60 in ursids because it may be a good indicator of chronic stress status, which is one reason why we cloned and sequenced this protein.
First, I need to find sequences with which to compare ursid Hsp60. I will be looking for related sequences with BLAST.
We'll be using nucleotide BLAST.
To grab the FASTA-formatted sequences, click on "Display Settings" and then "FASTA (text)". Copy and paste the sequences to a text document. At this point, if you don't like the FASTA headers (for example: >gi|10923782|ref|XM_1281221) that came with the sequences, you should change them now to avoid headaches later. Just replace everything after the bracket (">") with whatever you like. I prefer using the genus or species name and the gene abbreviation (for example: ">UrsusHsp60"). Avoid spaces because some programs don't like them. Additionally, I like to remove redundant cDNA sequences that are derived from different tissues in the same species if I know that the sequences aren't tissue specific or don't care about intra-species differences.
So I didn't actually find any Hsp60 nucleotide sequences using BLAST. This highly suggests that there's something wrong with my sequence, such as we cloned the wrong gene! To confirm, I could get known Hsp60 sequences by looking in a curated database. In some cases, you might prefer to skip the BLAST step and begin in the curated database. I like to use NCBI Gene and Homologene.
Searching for HSPD1 in Gene gives us a list of Hsp60 genes in different species. Homo sapiens HSPD1 is a good place to start (human genes are generally better characterized than most other species). Clicking on the link brings us to the curated human HSPD1 page, which is full of useful information about the gene.
The Homologene link is near the bottom right of the window.
Copy and paste the resulting list to the text document we created earlier and edit the sequences as necessary. Now, the phylogenetics package PHYLIP requires input in the form of a file containing the multiple sequence alignment (MSA) of all the Hsp60 sequences. We'll do this in ClustalX 2.0.12. Load the file containing all the FASTA-formatted sequences.
The sequences will load up and look something like this. They're not aligned yet, so select "Do Complete Alignment" from the Alignment menu. You can choose where to save the output.
If everything looks good, save the alignment as a PHYLIP file. It may give you a dialogue box warning you that the FASTA headers will be truncated. That's fine.
We can also construct a phylogenetic tree in ClustalX, but the only supported methods are the distance-based neighbour-joining (N-J) and unweighted pair group method with arithmetic mean (UPGMA). These methods are fairly fast in comparison to character-based or Bayesian methods, but they're generally considered to be inferior in that they don't necessarily use all the information that is available in the sequences to construct their trees. Additionally, ClustalX supports bootstrapping with N-J trees. Bootstrapping is a statistical method that runs the tree building simulation over a user-specified number of iterations. With each iteration, the program will randomly re-sample the MSA to generate the initial unresolved tree from which the program bases its distance calculations. The final bootstrapped tree reports a bootstrap statistic for each branch, which indicates the proportion of bootstrap iterations that results in the reported branch. A higher proportion indicates higher confidence in the tree, although I don't think there's a standardized cut-off. Generally, most bootstrap simulations are run with 1000 or more iterations. In ClustalX, select "Bootstrap N-J Tree" from the Trees menu.
This can be interpreted into a tree in Archaeopteryx.
PHYLIP is a command-line program with no GUI. I'm going to use the dnapars (maximum parsimony for DNA) package to find the most parsimonious phylogenetic tree. In this workflow, I will take the ClustalX MSA and use the seqboot package to randomly re-sample the MSA for a desired number of times (1000 bootstrap iterations, in this case). The sequence of the MSA will be re-ordered while keeping the alignment intact, or in other words, splitting the alignment into blocks of a specific size and randomly mixing them to form 1000 new sequences with the alignments intact. I'll feed these re-sampled MSAs into dnapars, and it'll construct a tree for each alignment. I'll then take these 1000 trees and construct a consensus tree in the consense package. Finally, I'll use Archaeopteryx to visualize the tree. When I first start seqboot, it'll ask for the input file (the ClustalX MSA):
Then it'll give you a menu with a few options. The major ones to worry about are:
J – Bootstrap, Jackknife, Permute, Rewrite? We want bootstrap.
R – How many replicates? Anything above 1000 iterations is good. You may want to run a bootstrap trial with less than 100 iterations through dnapars and consense to make sure everything looks alright before committing the time for the full 1000 iterations.
seqboot will then ask for a random number seed and the output file name (I'm using blastseqboot). When it's done, we can move on to dnapars.
Then it'll ask for an output file name.
Then it'll give you a menu full of options. The important ones are:
J – Randomize input order of sequences? This has already been done in seqboot, so select No.
O – Outgroup root? If you have an outgroup, then select the species number (in order of the input file). The outgroup should ideally be a somewhat distantly related sequence, but not so distant as to have no similarity to the sequences to be analyzed. In this example, I may use one of the bacterial Hsp60 sequences as the outgroup. There's also no problem with having an unrooted tree in most cases, but dnapars will use species 1 as default if you don't select one yourself
M – Analyze multiple data sets? Since we're doing 1000 bootstrap simulations, set this to 1000 data sets.
Maximum parsimony is quite a bit slower than N-J. Bootstrapping 1000 N-J simulations takes about a minute on my old early 2006 Macbook. Bootstrapping 1000 MP simulations on the same machine will take about three and a half hours.
When it's finally done, it will spit out a file with all 1000 generated phylogenetic trees. We now need to find the consensus tree using the consense package. Again, it will ask for the input file (the very large file containing all the output trees) and give a menu of options. It should be self explanatory.
Now to visualize the output in Archaeopteryx:
The blue numbers indicate the bootstrap statistics. It looks like all the HspA14 (and the ursid Hsp60) genes are clustered together, away from the actual Hsp60 sequences. If you want to edit anything, you'll have to export the file and edit it in a text editor. I prefer editing the tree as PhyloXML because markup is easier to read than NEWICK or NEXUS formats. You'll generally want the bootstrap statistics as a percentage, and we'll need to hand edit the clade names.
After some edits, it'll look like this:
And that's it!













































No comments:
Post a Comment