Monday, 5 September 2011

Procrastination Busting

So with the copious amounts of work I need to do this month in mind, I decided to look into various strategies to increase my focus and decrease my ADD-fueled procrastination.

I'm a big fan of the Pomodoro technique, wherein you study/work for 25 minutes, then take a 5 minute break. I downloaded the Pomodoro app (http://pomodoro.ugolandini.com/), which seems to be a cool utility that times the intervals and takes statistics on your productivity and pledged to myself to use it. I used to use Focus Booster (http://www.focusboosterapp.com/) for timing purposes, but it was a little too bare bones for my liking.

One big time sink was reading my RSS feeds a couple times an hour. I decided to delete many feeds, especially the ones that were just spamming a lot of random, uninteresting crap (Engadget, I'm looking at you), and kept just a select few. I also unpinned and closed my Google Reader tab.

Finally, I downloaded a program called Isolator (http://willmore.eu/software/isolator/), which just obfuscates windows in the background. Apparently, it helps to maintain focus. I'll report back on its effectiveness.

Friday, 19 August 2011

Sunday, 12 June 2011

Rib Rub Recipe

I haven't updated in a while, but that's more a function of me doing little else but hormone saturation assays for the past couple weeks. I did manage to invent a rub for chicken and ribs:
2 tablespoons garlic powder
1 tablespoon chili powder
1.5 tablespoons Sriracha sauce
1.5 tablespoons sesame oil
1 tablespoon salt
1 tablespoon pepper

It's pretty fantastic, if I say so myself.

Monday, 30 May 2011

Non-linear Regression

The rate at which a protein – with one binding site – binds to its ligand is related to the concentrations of the free and bound receptors, and to the concentration of the free ligand. The equation to describe these relationships is similar to the Michaelis-Menten equation, which describes enzyme kinematics. The binding kinetics equation is:
Where SB is defined as specific binding of the ligand to the protein, Bmax is the total protein concentration (or free and bound protein concentration), [L] is the free ligand concentration, and Kd is the dissociation rate constant (the ratio between the backward and forward reaction rate constants).


Usually, Kd is the one unknown variable in the equation that is of interest to me. To determine its value, I'll run a saturation binding assay in which I mix the protein with a dilution series of its ligand and measure the amount of protein-ligand complexes. The Kd can then be determined by Scatchard plotting or non-linear regression, the latter method of which is considered to be the gold standard.


I work with a protein called corticosteroid binding globulin (aka transcortin, CBG, and serpin peptidase inhibitor clade A member 6). One of its ligands in most mammals is cortisol, and the interplay between this protein-ligand pair plays significant roles in modulating the stress response, the immune system, and energy balance, amongst other systems. I use tritiated cortisol as a tracer molecule in a saturation binding assay to determine the binding kinematics of cortisol to CBG.


After running the assay, I get specific binding data in counts (of beta decays) per minute dependent on the free ligand concentration. An example data set is shown:
Dilution SB     LA        1134   46.28B        798    22.48C        656    10.95D        493    5.24E        305    2.50F        198    1.25
A plot of the data (plot()) shows an exponential relationship between SB and [L].




In a Scatchard plot, I plot specific binding per free ligand concentration on the y-axis against specific binding on the x-axis. This transformation results in a roughly linear plot that is suitable for linear regression analysis.
I'll use the lm() command to run a linear regression and abline() to plot it on the above scatterplot. These are the results:



Call:lm(formula = sbl.518 ~ sb.518.3, data = dfl.518)
Residuals:      1       2       3       4       5       6  19.697 -17.888 -14.008  -3.400  -2.666  18.264 
Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 168.76344   15.59183  10.824 0.000413 ***sb.518.3     -0.14458    0.02312  -6.253 0.003335 ** ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 17.72 on 4 degrees of freedomMultiple R-squared: 0.9072, Adjusted R-squared: 0.884 F-statistic:  39.1 on 1 and 4 DF,  p-value: 0.003335 

The absolute value of the slope of the line (the sb.518.3 coefficient) is inversely proportional to the Kd of CBG, or Kd = 1/0.1445 = 6.92. I can also calculate Bmax as the y-axis intercept (Intercept) divided by the absolute value of the slope. The adjusted R-squared value is not very high (0.884), so this suggests to me that this Kd is a ballpark estimate of the real Kd.

The other way to determine Kd is by non-linear regression. This method is considered to be 'better' than Scatchard analysis because I can define the equation of the line to be fitted to my data, rather than transforming my data to fit a linear line. This can be done in R using the nls() command:
nls.518 <- nls(SB ~ Bmax * L / (Kd + L), data = dfl.518, start = list(Bmax = 1167, Kd = 6.92)
So in the nls() command, I define the equation to be used to fit my data as SB ~ Bmax * L / (Kd + L), and I also define starting values for Bmax and Kd as the values I derived from my Scatchard analysis. R needs these values as the starting point to its analysis. My results are:
Formula: SB ~ Bmax * L / (Kd + L)
Parameters:
            Estimate Std. Error t value Pr(>|t|)  
Bmax         982.2157    17.2532   56.93 1.19e-05 ***
kd             5.3959     0.2556   21.11 0.000233 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.391 on 3 degrees of freedom
Number of iterations to convergence: 3
Achieved convergence tolerance: 5.493e-06 
So by using non-linear regression analysis, I've determined the Kd and Bmax to be 5.40 and 982.2, respectively, which are quite different than the values I obtained using Scatchard plotting. The p-values are highly significant, and the residuals are fantastic.

So what does it all mean? One practical interpretation of Kd is that at a ligand concentration equal to the Kd, half of the receptors are bound to its ligand. The lower the Kd, the higher the affinity of the receptor to its ligand, and vice versa.

Monday, 23 May 2011

More R

Another day, another round of new stuff learned about R. I've written a function that will calculate the means of a vector (y) aggregated by factor (z).
> avg = function(y,z) {
var1 = aggregate(y, by = list(z), FUN = mean, na.rm=T)
var2 = var1$x
}
The function returns a vector with the means. Likewise, I wrote a function for the calculation of standard error of the means (SEM):

> sem = function(y,z) {
var1 = aggregate(y, by = list(z), FUN = sd, na.rm=T)
var2 = aggregate(y, by = list(z), FUN = length)
var3 = var1$x / sqrt(var2$x) 
}
So with these functions, I can easily calculate the means + SEM for a large dataset. For example:
> data.means = data.frame(avg(y1,z),avg(y2,z),avg(y3,z),avg(y4,z),avg(y5,z),avg(y6,z))
> data.sems = data.frame(sem(y1,z),sem(y2,z),sem(y3,z),sem(y4,z),sem(y5,z),sem(y6,z))
I can give the rows and columns appropriate headers like so:
> rownames(data.means) = c("Hibernation", "Post-hibernation", "Pre-hibernation")
> colnames(data.means) = c("endof", "dsb", "dmcbc", "dfreef", "csb", "cfreef", "cmcbc")
So the formatted data frame will look like so:

> data.means     endof      dsb    dmcbc    dfreef      csb    cfreef    cmcbc
Hibernation      51.87500 49.06250 0.596250 22.851250 82.16667 11.875000 0.950000
Post-hibernation 33.86667 82.37778 1.233333  9.591111 85.81429  3.020000 1.335714

Pre-hibernation  13.60000 45.46250 0.551250  4.858750 84.30000  1.038333 1.058333
I want to rearrange the data so that Pre-hibernation comes first, then Hibernation and Post-hibernation. I'll first assign a variable with my desired order, then do the rearrangement.
> order = c("Pre-hibernation", "Hibernation", "Post-hibernation")
> data.means = data.means[order, , drop=F]
Now I want to plot all these data in a single bar plot. I'll have to convert the data frames into matrices because barplot() can't use data frames.
> data.means = data.matrix(data.means, rownames.force = T)
> data.sems = data.matrix(data.sems, rownames.force = T)
Now to plot the means, SEM, and add a legend to the top right of the plot:
> means.bp = barplot(data.means, main = "FreeF By State", ylab = "FreeF", xlab = "State", ylim = range(0,120), beside = T, names.arg = colnames(data.means), axis.lty = 1,col = gray.colors(3))
> error.bar(means.bp,data.means,data.sems)
> legend("topright", rownames(data.means), fill=gray.colors(3))
I've added the argument beside = T to the bar plot so that R will plot the data grouped into the columns present in the data matrix. So the arguments col and fill  do practically the same thing in barplot() and legend(), respectively: designate the colour scheme of the bars. In this case, I'm using the preset grayscale colour scheme in R.

Annotating Bar Plots with Statistical Significance

It only took me a few hours to figure this out. If I'm using 1-way ANOVAs as a statistical test to determine differences in the means of endogenous cortisol concentrations between animals of different sex classes, I can annotate the resulting bar plot with letters denoting significant differences. For example, I'll be using this data set (stored in the variable examp):
   Sample.ID       Sex.Class     EndoF
1          1            Male 134.70700
2          2            Male 112.79506
3          3            Male  83.42866
4          4            Male  66.25100
5          5            Male 111.49403
6          6            Male 105.33600
7          7            Male  68.47000
8          8            Male  68.32500
9          9            Male 100.12880
10        10            Male  90.18200
11        11 Solitary Female  62.52132
12        12 Solitary Female  66.99200
13        13 Solitary Female  77.37900
14        14 Solitary Female  40.61669
15        15 Solitary Female  74.61191
16        16 Solitary Female  29.36257
17        17 Solitary Female  93.74400
18        18 Solitary Female  52.07800
19        19 Solitary Female  50.51210
20        20 Solitary Female  56.12120
21        21    Adult Female  77.55365
22        22    Adult Female  86.24100
23        23    Adult Female 110.45902
24        24    Adult Female  97.06417
25        25    Adult Female 142.45900
26        26    Adult Female 169.85475
27        27    Adult Female 175.39755
28        28    Adult Female 145.52400
29        29    Adult Female  99.12721
30        30    Adult Female 110.12300
I'll calculate and store the means and SEM as per my earlier post. To determine if there are significant differences in the means of EndoF between Sex.Class, I'll punch in the command:
examp.anova = aov(EndoF ~ Sex.Class, data = examp)
This will store the ANOVA data in examp.anova. The summary of the analysis shows:
            Df Sum Sq Mean Sq F value    Pr(>F)  
Sex.Class    2  18666  9333.0  13.502 8.621e-05 ***
Residuals   27  18663   691.2                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
This suggests that there are significant differences! To determine in which sex classes endogenous F are different, I'll run a Tukey pairwise comparison analysis.
examp.pairs <- glht(examp.anova, linfct = mcp(Sex.Class = "Tukey"))
The summary of which shows:
> summary(examp.pairs)
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts

Fit: aov(formula = EndoF ~ Sex.Class, data = examp)
Linear Hypotheses:
                                    Estimate Std. Error t value Pr(>|t|)  
Male - Adult Female == 0              -27.27      11.76  -2.319    0.070 .
Solitary Female - Adult Female == 0   -60.99      11.76  -5.187   <0.001 ***
Solitary Female - Male == 0           -33.72      11.76  -2.868    0.021 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
So endogenous F in Solitary Females is significantly different than that in the other classes. I'll plot these results in bar plot form with SEM.
> examp.bp = barplot(examp.means, main = "Endogenous Cortisol by Sex Class", ylab = "Serum Total Cortisol (ng Cortisol/mL Serum)", xlab = "Sex Class", ylim = range(0,200), names.arg = order, axis.lty = 1)
> error.bar(examp.bp,examp.means,examp.sems)
To add letters displaying the statistical significance between groups, I'll just type in the commands:
letters = c("a","a","b") 
text(x=examp.bp,y=examp.means+examp.sems+10,label=letters)
This will add the letters manually to the bar plot 10 units above the SEM whiskers. There's a way to extract the pairwise comparison significance letters automatically using the cld() function.
examp.cld <- cld(examp.pairs)
This will store in examp.cld$mcletters$Letters the appropriate letters. However, they're stored in a different order than in examp.means and examp.sem. You just need to reorder either one. The final product:

Wednesday, 11 May 2011

Bar Plots with Error Bars in R

I've given up on Sigmaplot and dove back into plotting my results in R. This is a document of a process to plot means + standard error of the means (SEM).

First off, here's an example data set (FreeF.csv) in .csv format:
Sample,FreeF,State
1,63.46,A
2,0.5,B
3,0.33,C
4,0.16,B
5,0.45,C
6,0.1,C
7,0.18,B
8,0.6,A
9,2.01,A
10,16.84,C
11,1.27,C
12,0.14,B
13,1.75,C
14,5.13,B
15,0.84,A
16,0.82,A
17,3.52,A
18,0.4,C
19,0.12,B
I'll read this file into R:
> freef = read.csv(file="FreeF.csv",head=T,sep=",")
Next, we'll enter a function that I've borrowed and altered that will plot error bars for me from Monkey's Uncle. This function will plot a positive error bar only, unlike the original function.
> error.bar = function(x, y, upper, length=0.1,...){
+ arrows(x,y+upper, x, y, angle=90, code=1, length=length, ...) 
+ }
For this example data set, I have 3 states: A, B, and C. We'll have to calculate and store the FreeF means and standard errors for each state. We'll use the aggregate() function to do so:
> freef.mean = aggregate(freef$FreeF,by=list(freef$State),FUN=mean)
> freef.sd = aggregate(freef$FreeF,by=list(freef$State),FUN=sd)
> freef.n = aggregate(freef$FreeF,by=list(freef$State),FUN=length)
> freef.sem = freef.sd$x / sqrt(freef.n$x)
 We can now plot the data. First, we plot the means:
> freef.bp = barplot(freef.mean$x, main="FreeF By State", ylab="FreeF", xlab="State", ylim=range(0,25),  names.arg=freef.mean$Group.1, axis.lty=1)
This will plot a vertical bar plot for the FreeF means aggregated by State. I'll explain the arguments in the command:
main: Sets the title of the plot
ylab: Sets the y-axis label
xlab: Sets the x-axis label
ylim: Sets the y-axis limits
names.arg: Sets the x-axis labels for each State (A, B, and C)
axis.lty=1: Displays the x-axis in the same style as the y-axis (otherwise, it won't show the x-axis at all)
 Now to add the error bars:
> error.bar(freef.bp,freef.mean$x,freef.sem)
 Those are some massive-looking error bars. Oh well, back to lab.

Sunday, 1 May 2011

Constructing Phylogenetic Trees

Phylogenetic trees are constructs that show the evolutionary relationships between nucleotide or amino acid sequences of homologous genes in different or the same species by analyzing the differences in the sequences and inferring the sequences of changes leading to the input data. Phylogenetic trees are usually constructed after sequencing a gene as a form of error checking: how does the constructed tree compare to the established phylogenetic relationships between the compared species?

For this example, the lab tech had cloned an heat shock protein 60 (Hsp60) cDNA from a grizzly bear cDNA library using primers designed against human Hsp60 (HSPD1). This cDNA was sequenced, and I was assigned to construct a phylogenetic tree. I will be compiling the known homologs of Hsp60 from NCBI Homologene, looking for other homologs not in the curated Homologene database using BLAST, aligning these sequences using ClustalX, generating a maximum parsimony tree using PHYLIP, and finally, rendering the tree using Archaeopteryx.

Hsp60 (AKA chaperonin, HSPD1) in mammals is primarily known as a mitochondrial protein that is involved in the import and proper folding of other mitochondrial proteins (see any number of reviews here). It is a highly conserved protein, homologs of which can be found in almost all forms of life, including bacteria, fungi, and plants. Recent research has found that Hsp60 can be found also in the cytosolic compartment (ref), where it is involved in the regulation of apoptosis (ref). Hsp60 is also found in peripheral circulation (ref). The concentrations of circulating Hsp60 may be indicative of some disease states, including coronary heart disease (ref), osteoporosis (ref), prostate cancer (ref), and psychosocial stress (ref). I'm interested in Hsp60 in ursids because it may be a good indicator of chronic stress status, which is one reason why we cloned and sequenced this protein.

First, I need to find sequences with which to compare ursid Hsp60. I will be looking for related sequences with BLAST.


We'll be using nucleotide BLAST.


I'll paste my ursid Hsp60 sequence into the big box and initiate a search in the non-redundant nucleotide database.


And I get results! But these don't look like Hsp60 sequences.



It doesn't matter at this point, anyways. I'll just download all of these sequences by scrolling down and checking the "Select all sequences" box and clicking "Get Selected Sequences".


That'll bring me to this screen.


To grab the FASTA-formatted sequences, click on "Display Settings" and then "FASTA (text)". Copy and paste the sequences to a text document. At this point, if you don't like the FASTA headers (for example: >gi|10923782|ref|XM_1281221) that came with the sequences, you should change them now to avoid headaches later. Just replace everything after the bracket (">") with whatever you like. I prefer using the genus or species name and the gene abbreviation (for example: ">UrsusHsp60"). Avoid spaces because some programs don't like them. Additionally, I like to remove redundant cDNA sequences that are derived from different tissues in the same species if I know that the sequences aren't tissue specific or don't care about intra-species differences.



So I didn't actually find any Hsp60 nucleotide sequences using BLAST. This highly suggests that there's something wrong with my sequence, such as we cloned the wrong gene! To confirm, I could get known Hsp60 sequences by looking in a curated database. In some cases, you might prefer to skip the BLAST step and begin in the curated database. I like to use NCBI Gene and Homologene.



Searching for HSPD1 in Gene gives us a list of Hsp60 genes in different species. Homo sapiens HSPD1 is a good place to start (human genes are generally better characterized than most other species). Clicking on the link brings us to the curated human HSPD1 page, which is full of useful information about the gene.


The Homologene link is near the bottom right of the window.


We can download the mRNA sequences of HSPD1 for all of these species by clicking the "Download" link on the right, and then selecting to download the mRNA sequences from the dropdown menu.


Copy and paste the resulting list to the text document we created earlier and edit the sequences as necessary. Now, the phylogenetics package PHYLIP requires input in the form of a file containing the multiple sequence alignment (MSA) of all the Hsp60 sequences. We'll do this in ClustalX 2.0.12. Load the file containing all the FASTA-formatted sequences.


The sequences will load up and look something like this. They're not aligned yet, so select "Do Complete Alignment" from the Alignment menu. You can choose where to save the output.




Here are the results. You can scroll to the right to see the rest of the alignment and/or open the .aln file.



If everything looks good, save the alignment as a PHYLIP file. It may give you a dialogue box warning you that the FASTA headers will be truncated. That's fine.




Here is what the PHYLIP-formatted MSA looks like.


We can also construct a phylogenetic tree in ClustalX, but the only supported methods are the distance-based neighbour-joining (N-J) and unweighted pair group method with arithmetic mean (UPGMA). These methods are fairly fast in comparison to character-based or Bayesian methods, but they're generally considered to be inferior in that they don't necessarily use all the information that is available in the sequences to construct their trees. Additionally, ClustalX supports bootstrapping with N-J trees. Bootstrapping is a statistical method that runs the tree building simulation over a user-specified number of iterations. With each iteration, the program will randomly re-sample the MSA to generate the initial unresolved tree from which the program bases its distance calculations. The final bootstrapped tree reports a bootstrap statistic for each branch, which indicates the proportion of bootstrap iterations that results in the reported branch. A higher proportion indicates higher confidence in the tree, although I don't think there's a standardized cut-off. Generally, most bootstrap simulations are run with 1000 or more iterations. In ClustalX, select "Bootstrap N-J Tree" from the Trees menu.



The Random number generator seed is a number that is used in the pseudo-random number generator to re-sample the MSA. I don't believe that the chosen number really affects anything, so just choose any number. The output .phb file looks like this:


This can be interpreted into a tree in Archaeopteryx.


We can see from the tree that our ursid Hsp60 sequence is clearly not closely related to the other Hsp60 sequences. In fact, it looks like we sequenced HspA14, which is a poorly characterized member of the Hsp70 family. We can play with the tree a bit to make it prettier, but I would rather calculate a maximum parsimony tree in PHYLIP before I bother.

PHYLIP is a command-line program with no GUI. I'm going to use the dnapars (maximum parsimony for DNA) package to find the most parsimonious phylogenetic tree. In this workflow, I will take the ClustalX MSA and use the seqboot package to randomly re-sample the MSA for a desired number of times (1000 bootstrap iterations, in this case). The sequence of the MSA will be re-ordered while keeping the alignment intact, or in other words, splitting the alignment into blocks of a specific size and randomly mixing them to form 1000 new sequences with the alignments intact. I'll feed these re-sampled MSAs into dnapars, and it'll construct a tree for each alignment. I'll then take these 1000 trees and construct a consensus tree in the consense package. Finally, I'll use Archaeopteryx to visualize the tree. When I first start seqboot, it'll ask for the input file (the ClustalX MSA):


Then it'll give you a menu with a few options. The major ones to worry about are:

J – Bootstrap, Jackknife, Permute, Rewrite?  We want bootstrap.
R – How many replicates? Anything above 1000 iterations is good. You may want to run a bootstrap trial with less than 100 iterations through dnapars and consense to make sure everything looks alright before committing the time for the full 1000 iterations.


seqboot will then ask for a random number seed and the output file name (I'm using blastseqboot). When it's done, we can move on to dnapars.




dnapars will ask for the input file (the pre-bootstrapped output file from seqboot).


Then it'll ask for an output file name.


Then it'll give you a menu full of options. The important ones are:

J – Randomize input order of sequences? This has already been done in seqboot, so select No.
O – Outgroup root? If you have an outgroup, then select the species number (in order of the input file). The outgroup should ideally be a somewhat distantly related sequence, but not so distant as to have no similarity to the sequences to be analyzed. In this example, I may use one of the bacterial Hsp60 sequences as the outgroup. There's also no problem with having an unrooted tree in most cases, but dnapars will use species 1 as default if you don't select one yourself
M – Analyze multiple data sets? Since we're doing 1000 bootstrap simulations, set this to 1000 data sets.


Maximum parsimony is quite a bit slower than N-J. Bootstrapping 1000 N-J simulations takes about a minute on my old early 2006 Macbook. Bootstrapping 1000 MP simulations on the same machine will take about three and a half hours.


When it's finally done, it will spit out a file with all 1000 generated phylogenetic trees. We now need to find the consensus tree using the consense package. Again, it will ask for the input file (the very large file containing all the output trees) and give a menu of options. It should be self explanatory.


Now to visualize the output in Archaeopteryx:


The blue numbers indicate the bootstrap statistics. It looks like all the HspA14 (and the ursid Hsp60) genes are clustered together, away from the actual Hsp60 sequences. If you want to edit anything, you'll have to export the file and edit it in a text editor. I prefer editing the tree as PhyloXML because markup is easier to read than NEWICK or NEXUS formats. You'll generally want the bootstrap statistics as a percentage, and we'll need to hand edit the clade names.



After some edits, it'll look like this:



And that's it!