Sonntag, 3. April 2016

Clustering data with k-means and plotting for exploratory analysis

I clustered data from my paper A computational method for the systematic screening of reaction barriers in enzymes: searching for Bacillus circulans xylanase mutants with greater activity towards a synthetic substrate using k-means clustering and plotted the results using the `Lattice` library in R.
This analysis could be used to further automate selection of reaction barriers for including or excluding from analysis if the barrier profile is not physically meaningfull.

Raw data:
http://pastebin.com/41TrNNvH

A detailed report on the analyis is available here:
http://rpubs.com/mzh/167769

Github repository:
https://github.com/mzhKU/Enzyme-Barrier-Clustering

The critical part of the code is where the two plots (the k-means clusters, red lines, and the raw data of each cluster, blue lines) are overlayed using the lattice panel plot construct (in 003_execute.r):

# 'group' required to prevent drawing of a continuous line connecting the last
# data point of a mutant with the first data point of the next mutant barrier.
xyplot(Barrier_Cl ~ x|Cluster_fac, data=kmdf, iedf_i=iedf,
       ylim=c(-20, 50), ylab="Reaction Energy [kcal/mol]",
       xlab="Reaction Coordinate", xlim=c(1, 12), strip=T,
       # Note:
       # 'x' and 'y' are the data from the cluster data frame 'kmdf'.
       # 'iedf_i' is the 'iedf'-data frame provided to the 'i'-nner panel function.
       panel = function(x, y, iedf_i)
       {
            panel.xyplot(x=iedf_i[iedf_i$Cluster==panel.number(), ]$x_axis,
                         y=iedf_i[iedf_i$Cluster==panel.number(), ]$Barrier_i,
                         group=iedf_i$id, subscripts=TRUE, type="o", col="blue")
            panel.xyplot(x, y, type="l", col="red", lwd=2)
       }
)