So with the copious amounts of work I need to do this month in mind, I decided to look into various strategies to increase my focus and decrease my ADD-fueled procrastination.
I'm a big fan of the Pomodoro technique, wherein you study/work for 25 minutes, then take a 5 minute break. I downloaded the Pomodoro app (http://pomodoro.ugolandini.com/), which seems to be a cool utility that times the intervals and takes statistics on your productivity and pledged to myself to use it. I used to use Focus Booster (http://www.focusboosterapp.com/) for timing purposes, but it was a little too bare bones for my liking.
One big time sink was reading my RSS feeds a couple times an hour. I decided to delete many feeds, especially the ones that were just spamming a lot of random, uninteresting crap (Engadget, I'm looking at you), and kept just a select few. I also unpinned and closed my Google Reader tab.
Finally, I downloaded a program called Isolator (http://willmore.eu/software/isolator/), which just obfuscates windows in the background. Apparently, it helps to maintain focus. I'll report back on its effectiveness.
Monday, 5 September 2011
Thursday, 25 August 2011
Friday, 19 August 2011
Biologists' discovery may force revision of biology textbooks
What a silly headline for a not-so-earth-shattering finding: http://www.physorg.com/news/2011-08-biologists-discovery-biology-textbooks.html. Technically, shouldn't textbooks be re-written after any new finding?
Sunday, 12 June 2011
Rib Rub Recipe
I haven't updated in a while, but that's more a function of me doing little else but hormone saturation assays for the past couple weeks. I did manage to invent a rub for chicken and ribs:
2 tablespoons garlic powder
1 tablespoon chili powder
1.5 tablespoons Sriracha sauce
1.5 tablespoons sesame oil
1 tablespoon salt
1 tablespoon pepper
It's pretty fantastic, if I say so myself.
Monday, 30 May 2011
Non-linear Regression
The rate at which a protein – with one binding site – binds to its ligand is related to the concentrations of the free and bound receptors, and to the concentration of the free ligand. The equation to describe these relationships is similar to the Michaelis-Menten equation, which describes enzyme kinematics. The binding kinetics equation is:
Where SB is defined as specific binding of the ligand to the protein, Bmax is the total protein concentration (or free and bound protein concentration), [L] is the free ligand concentration, and Kd is the dissociation rate constant (the ratio between the backward and forward reaction rate constants).
Usually, Kd is the one unknown variable in the equation that is of interest to me. To determine its value, I'll run a saturation binding assay in which I mix the protein with a dilution series of its ligand and measure the amount of protein-ligand complexes. The Kd can then be determined by Scatchard plotting or non-linear regression, the latter method of which is considered to be the gold standard.
I work with a protein called corticosteroid binding globulin (aka transcortin, CBG, and serpin peptidase inhibitor clade A member 6). One of its ligands in most mammals is cortisol, and the interplay between this protein-ligand pair plays significant roles in modulating the stress response, the immune system, and energy balance, amongst other systems. I use tritiated cortisol as a tracer molecule in a saturation binding assay to determine the binding kinematics of cortisol to CBG.
After running the assay, I get specific binding data in counts (of beta decays) per minute dependent on the free ligand concentration. An example data set is shown:
In a Scatchard plot, I plot specific binding per free ligand concentration on the y-axis against specific binding on the x-axis. This transformation results in a roughly linear plot that is suitable for linear regression analysis.
I'll use the lm() command to run a linear regression and abline() to plot it on the above scatterplot. These are the results:
So what does it all mean? One practical interpretation of Kd is that at a ligand concentration equal to the Kd, half of the receptors are bound to its ligand. The lower the Kd, the higher the affinity of the receptor to its ligand, and vice versa.
Where SB is defined as specific binding of the ligand to the protein, Bmax is the total protein concentration (or free and bound protein concentration), [L] is the free ligand concentration, and Kd is the dissociation rate constant (the ratio between the backward and forward reaction rate constants).
Usually, Kd is the one unknown variable in the equation that is of interest to me. To determine its value, I'll run a saturation binding assay in which I mix the protein with a dilution series of its ligand and measure the amount of protein-ligand complexes. The Kd can then be determined by Scatchard plotting or non-linear regression, the latter method of which is considered to be the gold standard.
I work with a protein called corticosteroid binding globulin (aka transcortin, CBG, and serpin peptidase inhibitor clade A member 6). One of its ligands in most mammals is cortisol, and the interplay between this protein-ligand pair plays significant roles in modulating the stress response, the immune system, and energy balance, amongst other systems. I use tritiated cortisol as a tracer molecule in a saturation binding assay to determine the binding kinematics of cortisol to CBG.
After running the assay, I get specific binding data in counts (of beta decays) per minute dependent on the free ligand concentration. An example data set is shown:
Dilution SB LA 1134 46.28B 798 22.48C 656 10.95D 493 5.24E 305 2.50F 198 1.25A plot of the data (plot()) shows an exponential relationship between SB and [L].
In a Scatchard plot, I plot specific binding per free ligand concentration on the y-axis against specific binding on the x-axis. This transformation results in a roughly linear plot that is suitable for linear regression analysis.
I'll use the lm() command to run a linear regression and abline() to plot it on the above scatterplot. These are the results:
Call:lm(formula = sbl.518 ~ sb.518.3, data = dfl.518)
Residuals: 1 2 3 4 5 6 19.697 -17.888 -14.008 -3.400 -2.666 18.264
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 168.76344 15.59183 10.824 0.000413 ***sb.518.3 -0.14458 0.02312 -6.253 0.003335 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.72 on 4 degrees of freedomMultiple R-squared: 0.9072, Adjusted R-squared: 0.884 F-statistic: 39.1 on 1 and 4 DF, p-value: 0.003335
The absolute value of the slope of the line (the sb.518.3 coefficient) is inversely proportional to the Kd of CBG, or Kd = 1/0.1445 = 6.92. I can also calculate Bmax as the y-axis intercept (Intercept) divided by the absolute value of the slope. The adjusted R-squared value is not very high (0.884), so this suggests to me that this Kd is a ballpark estimate of the real Kd.
The other way to determine Kd is by non-linear regression. This method is considered to be 'better' than Scatchard analysis because I can define the equation of the line to be fitted to my data, rather than transforming my data to fit a linear line. This can be done in R using the nls() command:
nls.518 <- nls(SB ~ Bmax * L / (Kd + L), data = dfl.518, start = list(Bmax = 1167, Kd = 6.92)
So in the nls() command, I define the equation to be used to fit my data as SB ~ Bmax * L / (Kd + L), and I also define starting values for Bmax and Kd as the values I derived from my Scatchard analysis. R needs these values as the starting point to its analysis. My results are:
Formula: SB ~ Bmax * L / (Kd + L)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Bmax 982.2157 17.2532 56.93 1.19e-05 ***
kd 5.3959 0.2556 21.11 0.000233 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.391 on 3 degrees of freedom
Number of iterations to convergence: 3
Achieved convergence tolerance: 5.493e-06
So by using non-linear regression analysis, I've determined the Kd and Bmax to be 5.40 and 982.2, respectively, which are quite different than the values I obtained using Scatchard plotting. The p-values are highly significant, and the residuals are fantastic.
Monday, 23 May 2011
More R
Another day, another round of new stuff learned about R. I've written a function that will calculate the means of a vector (y) aggregated by factor (z).
> avg = function(y,z) {
var1 = aggregate(y, by = list(z), FUN = mean, na.rm=T)
var2 = var1$x
}The function returns a vector with the means. Likewise, I wrote a function for the calculation of standard error of the means (SEM):
> sem = function(y,z) {
var1 = aggregate(y, by = list(z), FUN = sd, na.rm=T)
var2 = aggregate(y, by = list(z), FUN = length)
var3 = var1$x / sqrt(var2$x)
}So with these functions, I can easily calculate the means + SEM for a large dataset. For example:
> data.means = data.frame(avg(y1,z),avg(y2,z),avg(y3,z),avg(y4,z),avg(y5,z),avg(y6,z))
> data.sems = data.frame(sem(y1,z),sem(y2,z),sem(y3,z),sem(y4,z),sem(y5,z),sem(y6,z))I can give the rows and columns appropriate headers like so:
> rownames(data.means) = c("Hibernation", "Post-hibernation", "Pre-hibernation")
> colnames(data.means) = c("endof", "dsb", "dmcbc", "dfreef", "csb", "cfreef", "cmcbc")So the formatted data frame will look like so:
> data.means endof dsb dmcbc dfreef csb cfreef cmcbcI want to rearrange the data so that Pre-hibernation comes first, then Hibernation and Post-hibernation. I'll first assign a variable with my desired order, then do the rearrangement.
Hibernation 51.87500 49.06250 0.596250 22.851250 82.16667 11.875000 0.950000
Post-hibernation 33.86667 82.37778 1.233333 9.591111 85.81429 3.020000 1.335714
Pre-hibernation 13.60000 45.46250 0.551250 4.858750 84.30000 1.038333 1.058333
> order = c("Pre-hibernation", "Hibernation", "Post-hibernation")
> data.means = data.means[order, , drop=F]Now I want to plot all these data in a single bar plot. I'll have to convert the data frames into matrices because barplot() can't use data frames.
> data.means = data.matrix(data.means, rownames.force = T)
> data.sems = data.matrix(data.sems, rownames.force = T)Now to plot the means, SEM, and add a legend to the top right of the plot:
> means.bp = barplot(data.means, main = "FreeF By State", ylab = "FreeF", xlab = "State", ylim = range(0,120), beside = T, names.arg = colnames(data.means), axis.lty = 1,col = gray.colors(3))
> error.bar(means.bp,data.means,data.sems)
> legend("topright", rownames(data.means), fill=gray.colors(3))I've added the argument beside = T to the bar plot so that R will plot the data grouped into the columns present in the data matrix. So the arguments col and fill do practically the same thing in barplot() and legend(), respectively: designate the colour scheme of the bars. In this case, I'm using the preset grayscale colour scheme in R.
Annotating Bar Plots with Statistical Significance
It only took me a few hours to figure this out. If I'm using 1-way ANOVAs as a statistical test to determine differences in the means of endogenous cortisol concentrations between animals of different sex classes, I can annotate the resulting bar plot with letters denoting significant differences. For example, I'll be using this data set (stored in the variable examp):
Sample.ID Sex.Class EndoF
1 1 Male 134.70700
2 2 Male 112.79506
3 3 Male 83.42866
4 4 Male 66.25100
5 5 Male 111.49403
6 6 Male 105.33600
7 7 Male 68.47000
8 8 Male 68.32500
9 9 Male 100.12880
10 10 Male 90.18200
11 11 Solitary Female 62.52132
12 12 Solitary Female 66.99200
13 13 Solitary Female 77.37900
14 14 Solitary Female 40.61669
15 15 Solitary Female 74.61191
16 16 Solitary Female 29.36257
17 17 Solitary Female 93.74400
18 18 Solitary Female 52.07800
19 19 Solitary Female 50.51210
20 20 Solitary Female 56.12120
21 21 Adult Female 77.55365
22 22 Adult Female 86.24100
23 23 Adult Female 110.45902
24 24 Adult Female 97.06417
25 25 Adult Female 142.45900
26 26 Adult Female 169.85475
27 27 Adult Female 175.39755
28 28 Adult Female 145.52400
29 29 Adult Female 99.12721
30 30 Adult Female 110.12300
I'll calculate and store the means and SEM as per my earlier post. To determine if there are significant differences in the means of EndoF between Sex.Class, I'll punch in the command:
examp.anova = aov(EndoF ~ Sex.Class, data = examp)
This will store the ANOVA data in examp.anova. The summary of the analysis shows:
Df Sum Sq Mean Sq F value Pr(>F)
Sex.Class 2 18666 9333.0 13.502 8.621e-05 ***
Residuals 27 18663 691.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This suggests that there are significant differences! To determine in which sex classes endogenous F are different, I'll run a Tukey pairwise comparison analysis.
examp.pairs <- glht(examp.anova, linfct = mcp(Sex.Class = "Tukey"))
The summary of which shows:
> summary(examp.pairs)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: aov(formula = EndoF ~ Sex.Class, data = examp)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Male - Adult Female == 0 -27.27 11.76 -2.319 0.070 .
Solitary Female - Adult Female == 0 -60.99 11.76 -5.187 <0.001 ***
Solitary Female - Male == 0 -33.72 11.76 -2.868 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
So endogenous F in Solitary Females is significantly different than that in the other classes. I'll plot these results in bar plot form with SEM.
> examp.bp = barplot(examp.means, main = "Endogenous Cortisol by Sex Class", ylab = "Serum Total Cortisol (ng Cortisol/mL Serum)", xlab = "Sex Class", ylim = range(0,200), names.arg = order, axis.lty = 1)
> error.bar(examp.bp,examp.means,examp.sems)
To add letters displaying the statistical significance between groups, I'll just type in the commands:
letters = c("a","a","b")
text(x=examp.bp,y=examp.means+examp.sems+10,label=letters)
This will add the letters manually to the bar plot 10 units above the SEM whiskers. There's a way to extract the pairwise comparison significance letters automatically using the cld() function.
examp.cld <- cld(examp.pairs)
This will store in examp.cld$mcletters$Letters the appropriate letters. However, they're stored in a different order than in examp.means and examp.sem. You just need to reorder either one. The final product:
Wednesday, 11 May 2011
Bar Plots with Error Bars in R
I've given up on Sigmaplot and dove back into plotting my results in R. This is a document of a process to plot means + standard error of the means (SEM).
First off, here's an example data set (FreeF.csv) in .csv format:
First off, here's an example data set (FreeF.csv) in .csv format:
I'll read this file into R:Sample,FreeF,State1,63.46,A2,0.5,B3,0.33,C4,0.16,B5,0.45,C6,0.1,C7,0.18,B8,0.6,A9,2.01,A10,16.84,C11,1.27,C12,0.14,B13,1.75,C14,5.13,B15,0.84,A16,0.82,A17,3.52,A18,0.4,C19,0.12,B
> freef = read.csv(file="FreeF.csv",head=T,sep=",")Next, we'll enter a function that I've borrowed and altered that will plot error bars for me from Monkey's Uncle. This function will plot a positive error bar only, unlike the original function.
> error.bar = function(x, y, upper, length=0.1,...){
+ arrows(x,y+upper, x, y, angle=90, code=1, length=length, ...)
+ }For this example data set, I have 3 states: A, B, and C. We'll have to calculate and store the FreeF means and standard errors for each state. We'll use the aggregate() function to do so:
> freef.mean = aggregate(freef$FreeF,by=list(freef$State),FUN=mean)
> freef.sd = aggregate(freef$FreeF,by=list(freef$State),FUN=sd)
> freef.n = aggregate(freef$FreeF,by=list(freef$State),FUN=length)
> freef.sem = freef.sd$x / sqrt(freef.n$x)We can now plot the data. First, we plot the means:
> freef.bp = barplot(freef.mean$x, main="FreeF By State", ylab="FreeF", xlab="State", ylim=range(0,25), names.arg=freef.mean$Group.1, axis.lty=1)This will plot a vertical bar plot for the FreeF means aggregated by State. I'll explain the arguments in the command:
main: Sets the title of the plot
ylab: Sets the y-axis label
xlab: Sets the x-axis label
ylim: Sets the y-axis limits
names.arg: Sets the x-axis labels for each State (A, B, and C)
axis.lty=1: Displays the x-axis in the same style as the y-axis (otherwise, it won't show the x-axis at all)
Now to add the error bars:
> error.bar(freef.bp,freef.mean$x,freef.sem)Those are some massive-looking error bars. Oh well, back to lab.
Sunday, 1 May 2011
Constructing Phylogenetic Trees
Phylogenetic trees are constructs that show the evolutionary relationships between nucleotide or amino acid sequences of homologous genes in different or the same species by analyzing the differences in the sequences and inferring the sequences of changes leading to the input data. Phylogenetic trees are usually constructed after sequencing a gene as a form of error checking: how does the constructed tree compare to the established phylogenetic relationships between the compared species?
For this example, the lab tech had cloned an heat shock protein 60 (Hsp60) cDNA from a grizzly bear cDNA library using primers designed against human Hsp60 (HSPD1). This cDNA was sequenced, and I was assigned to construct a phylogenetic tree. I will be compiling the known homologs of Hsp60 from NCBI Homologene, looking for other homologs not in the curated Homologene database using BLAST, aligning these sequences using ClustalX, generating a maximum parsimony tree using PHYLIP, and finally, rendering the tree using Archaeopteryx.
Hsp60 (AKA chaperonin, HSPD1) in mammals is primarily known as a mitochondrial protein that is involved in the import and proper folding of other mitochondrial proteins (see any number of reviews here). It is a highly conserved protein, homologs of which can be found in almost all forms of life, including bacteria, fungi, and plants. Recent research has found that Hsp60 can be found also in the cytosolic compartment (ref), where it is involved in the regulation of apoptosis (ref). Hsp60 is also found in peripheral circulation (ref). The concentrations of circulating Hsp60 may be indicative of some disease states, including coronary heart disease (ref), osteoporosis (ref), prostate cancer (ref), and psychosocial stress (ref). I'm interested in Hsp60 in ursids because it may be a good indicator of chronic stress status, which is one reason why we cloned and sequenced this protein.
First, I need to find sequences with which to compare ursid Hsp60. I will be looking for related sequences with BLAST.
I'll paste my ursid Hsp60 sequence into the big box and initiate a search in the non-redundant nucleotide database.
And I get results! But these don't look like Hsp60 sequences.
It doesn't matter at this point, anyways. I'll just download all of these sequences by scrolling down and checking the "Select all sequences" box and clicking "Get Selected Sequences".
That'll bring me to this screen.
To grab the FASTA-formatted sequences, click on "Display Settings" and then "FASTA (text)". Copy and paste the sequences to a text document. At this point, if you don't like the FASTA headers (for example: >gi|10923782|ref|XM_1281221) that came with the sequences, you should change them now to avoid headaches later. Just replace everything after the bracket (">") with whatever you like. I prefer using the genus or species name and the gene abbreviation (for example: ">UrsusHsp60"). Avoid spaces because some programs don't like them. Additionally, I like to remove redundant cDNA sequences that are derived from different tissues in the same species if I know that the sequences aren't tissue specific or don't care about intra-species differences.
So I didn't actually find any Hsp60 nucleotide sequences using BLAST. This highly suggests that there's something wrong with my sequence, such as we cloned the wrong gene! To confirm, I could get known Hsp60 sequences by looking in a curated database. In some cases, you might prefer to skip the BLAST step and begin in the curated database. I like to use NCBI Gene and Homologene.
Searching for HSPD1 in Gene gives us a list of Hsp60 genes in different species. Homo sapiens HSPD1 is a good place to start (human genes are generally better characterized than most other species). Clicking on the link brings us to the curated human HSPD1 page, which is full of useful information about the gene.
The Homologene link is near the bottom right of the window.
We can download the mRNA sequences of HSPD1 for all of these species by clicking the "Download" link on the right, and then selecting to download the mRNA sequences from the dropdown menu.
Copy and paste the resulting list to the text document we created earlier and edit the sequences as necessary. Now, the phylogenetics package PHYLIP requires input in the form of a file containing the multiple sequence alignment (MSA) of all the Hsp60 sequences. We'll do this in ClustalX 2.0.12. Load the file containing all the FASTA-formatted sequences.
The sequences will load up and look something like this. They're not aligned yet, so select "Do Complete Alignment" from the Alignment menu. You can choose where to save the output.
Here are the results. You can scroll to the right to see the rest of the alignment and/or open the .aln file.
If everything looks good, save the alignment as a PHYLIP file. It may give you a dialogue box warning you that the FASTA headers will be truncated. That's fine.
Here is what the PHYLIP-formatted MSA looks like.
We can also construct a phylogenetic tree in ClustalX, but the only supported methods are the distance-based neighbour-joining (N-J) and unweighted pair group method with arithmetic mean (UPGMA). These methods are fairly fast in comparison to character-based or Bayesian methods, but they're generally considered to be inferior in that they don't necessarily use all the information that is available in the sequences to construct their trees. Additionally, ClustalX supports bootstrapping with N-J trees. Bootstrapping is a statistical method that runs the tree building simulation over a user-specified number of iterations. With each iteration, the program will randomly re-sample the MSA to generate the initial unresolved tree from which the program bases its distance calculations. The final bootstrapped tree reports a bootstrap statistic for each branch, which indicates the proportion of bootstrap iterations that results in the reported branch. A higher proportion indicates higher confidence in the tree, although I don't think there's a standardized cut-off. Generally, most bootstrap simulations are run with 1000 or more iterations. In ClustalX, select "Bootstrap N-J Tree" from the Trees menu.
The Random number generator seed is a number that is used in the pseudo-random number generator to re-sample the MSA. I don't believe that the chosen number really affects anything, so just choose any number. The output .phb file looks like this:
This can be interpreted into a tree in Archaeopteryx.
We can see from the tree that our ursid Hsp60 sequence is clearly not closely related to the other Hsp60 sequences. In fact, it looks like we sequenced HspA14, which is a poorly characterized member of the Hsp70 family. We can play with the tree a bit to make it prettier, but I would rather calculate a maximum parsimony tree in PHYLIP before I bother.
PHYLIP is a command-line program with no GUI. I'm going to use the dnapars (maximum parsimony for DNA) package to find the most parsimonious phylogenetic tree. In this workflow, I will take the ClustalX MSA and use the seqboot package to randomly re-sample the MSA for a desired number of times (1000 bootstrap iterations, in this case). The sequence of the MSA will be re-ordered while keeping the alignment intact, or in other words, splitting the alignment into blocks of a specific size and randomly mixing them to form 1000 new sequences with the alignments intact. I'll feed these re-sampled MSAs into dnapars, and it'll construct a tree for each alignment. I'll then take these 1000 trees and construct a consensus tree in the consense package. Finally, I'll use Archaeopteryx to visualize the tree. When I first start seqboot, it'll ask for the input file (the ClustalX MSA):
Then it'll give you a menu with a few options. The major ones to worry about are:
J – Bootstrap, Jackknife, Permute, Rewrite? We want bootstrap.
R – How many replicates? Anything above 1000 iterations is good. You may want to run a bootstrap trial with less than 100 iterations through dnapars and consense to make sure everything looks alright before committing the time for the full 1000 iterations.
dnapars will ask for the input file (the pre-bootstrapped output file from seqboot).
For this example, the lab tech had cloned an heat shock protein 60 (Hsp60) cDNA from a grizzly bear cDNA library using primers designed against human Hsp60 (HSPD1). This cDNA was sequenced, and I was assigned to construct a phylogenetic tree. I will be compiling the known homologs of Hsp60 from NCBI Homologene, looking for other homologs not in the curated Homologene database using BLAST, aligning these sequences using ClustalX, generating a maximum parsimony tree using PHYLIP, and finally, rendering the tree using Archaeopteryx.
Hsp60 (AKA chaperonin, HSPD1) in mammals is primarily known as a mitochondrial protein that is involved in the import and proper folding of other mitochondrial proteins (see any number of reviews here). It is a highly conserved protein, homologs of which can be found in almost all forms of life, including bacteria, fungi, and plants. Recent research has found that Hsp60 can be found also in the cytosolic compartment (ref), where it is involved in the regulation of apoptosis (ref). Hsp60 is also found in peripheral circulation (ref). The concentrations of circulating Hsp60 may be indicative of some disease states, including coronary heart disease (ref), osteoporosis (ref), prostate cancer (ref), and psychosocial stress (ref). I'm interested in Hsp60 in ursids because it may be a good indicator of chronic stress status, which is one reason why we cloned and sequenced this protein.
First, I need to find sequences with which to compare ursid Hsp60. I will be looking for related sequences with BLAST.
We'll be using nucleotide BLAST.
To grab the FASTA-formatted sequences, click on "Display Settings" and then "FASTA (text)". Copy and paste the sequences to a text document. At this point, if you don't like the FASTA headers (for example: >gi|10923782|ref|XM_1281221) that came with the sequences, you should change them now to avoid headaches later. Just replace everything after the bracket (">") with whatever you like. I prefer using the genus or species name and the gene abbreviation (for example: ">UrsusHsp60"). Avoid spaces because some programs don't like them. Additionally, I like to remove redundant cDNA sequences that are derived from different tissues in the same species if I know that the sequences aren't tissue specific or don't care about intra-species differences.
So I didn't actually find any Hsp60 nucleotide sequences using BLAST. This highly suggests that there's something wrong with my sequence, such as we cloned the wrong gene! To confirm, I could get known Hsp60 sequences by looking in a curated database. In some cases, you might prefer to skip the BLAST step and begin in the curated database. I like to use NCBI Gene and Homologene.
Searching for HSPD1 in Gene gives us a list of Hsp60 genes in different species. Homo sapiens HSPD1 is a good place to start (human genes are generally better characterized than most other species). Clicking on the link brings us to the curated human HSPD1 page, which is full of useful information about the gene.
The Homologene link is near the bottom right of the window.
Copy and paste the resulting list to the text document we created earlier and edit the sequences as necessary. Now, the phylogenetics package PHYLIP requires input in the form of a file containing the multiple sequence alignment (MSA) of all the Hsp60 sequences. We'll do this in ClustalX 2.0.12. Load the file containing all the FASTA-formatted sequences.
The sequences will load up and look something like this. They're not aligned yet, so select "Do Complete Alignment" from the Alignment menu. You can choose where to save the output.
If everything looks good, save the alignment as a PHYLIP file. It may give you a dialogue box warning you that the FASTA headers will be truncated. That's fine.
We can also construct a phylogenetic tree in ClustalX, but the only supported methods are the distance-based neighbour-joining (N-J) and unweighted pair group method with arithmetic mean (UPGMA). These methods are fairly fast in comparison to character-based or Bayesian methods, but they're generally considered to be inferior in that they don't necessarily use all the information that is available in the sequences to construct their trees. Additionally, ClustalX supports bootstrapping with N-J trees. Bootstrapping is a statistical method that runs the tree building simulation over a user-specified number of iterations. With each iteration, the program will randomly re-sample the MSA to generate the initial unresolved tree from which the program bases its distance calculations. The final bootstrapped tree reports a bootstrap statistic for each branch, which indicates the proportion of bootstrap iterations that results in the reported branch. A higher proportion indicates higher confidence in the tree, although I don't think there's a standardized cut-off. Generally, most bootstrap simulations are run with 1000 or more iterations. In ClustalX, select "Bootstrap N-J Tree" from the Trees menu.
This can be interpreted into a tree in Archaeopteryx.
PHYLIP is a command-line program with no GUI. I'm going to use the dnapars (maximum parsimony for DNA) package to find the most parsimonious phylogenetic tree. In this workflow, I will take the ClustalX MSA and use the seqboot package to randomly re-sample the MSA for a desired number of times (1000 bootstrap iterations, in this case). The sequence of the MSA will be re-ordered while keeping the alignment intact, or in other words, splitting the alignment into blocks of a specific size and randomly mixing them to form 1000 new sequences with the alignments intact. I'll feed these re-sampled MSAs into dnapars, and it'll construct a tree for each alignment. I'll then take these 1000 trees and construct a consensus tree in the consense package. Finally, I'll use Archaeopteryx to visualize the tree. When I first start seqboot, it'll ask for the input file (the ClustalX MSA):
Then it'll give you a menu with a few options. The major ones to worry about are:
J – Bootstrap, Jackknife, Permute, Rewrite? We want bootstrap.
R – How many replicates? Anything above 1000 iterations is good. You may want to run a bootstrap trial with less than 100 iterations through dnapars and consense to make sure everything looks alright before committing the time for the full 1000 iterations.
seqboot will then ask for a random number seed and the output file name (I'm using blastseqboot). When it's done, we can move on to dnapars.
Then it'll ask for an output file name.
Then it'll give you a menu full of options. The important ones are:
J – Randomize input order of sequences? This has already been done in seqboot, so select No.
O – Outgroup root? If you have an outgroup, then select the species number (in order of the input file). The outgroup should ideally be a somewhat distantly related sequence, but not so distant as to have no similarity to the sequences to be analyzed. In this example, I may use one of the bacterial Hsp60 sequences as the outgroup. There's also no problem with having an unrooted tree in most cases, but dnapars will use species 1 as default if you don't select one yourself
M – Analyze multiple data sets? Since we're doing 1000 bootstrap simulations, set this to 1000 data sets.
Maximum parsimony is quite a bit slower than N-J. Bootstrapping 1000 N-J simulations takes about a minute on my old early 2006 Macbook. Bootstrapping 1000 MP simulations on the same machine will take about three and a half hours.
When it's finally done, it will spit out a file with all 1000 generated phylogenetic trees. We now need to find the consensus tree using the consense package. Again, it will ask for the input file (the very large file containing all the output trees) and give a menu of options. It should be self explanatory.
Now to visualize the output in Archaeopteryx:
The blue numbers indicate the bootstrap statistics. It looks like all the HspA14 (and the ursid Hsp60) genes are clustered together, away from the actual Hsp60 sequences. If you want to edit anything, you'll have to export the file and edit it in a text editor. I prefer editing the tree as PhyloXML because markup is easier to read than NEWICK or NEXUS formats. You'll generally want the bootstrap statistics as a percentage, and we'll need to hand edit the clade names.
After some edits, it'll look like this:
And that's it!
Subscribe to:
Comments (Atom)




















































