The package hybridHclust
provides R functions for carrying out
hybrid hierarchical clustering using "mutual custers".
Such clustering is useful when different levels of clustering resolution
are required.
For example, interest may focus both on tightly clustered small
sets of observations and larger, broad clusters.
The methodology is described in the paper:
hclust
with single, average or
complete linkage) preserve mutual clusters, in the sense that each mutual
cluster is subdivided only after it is isolated as a cluster in the
dendrogram.
Bottom-up clustering, which successively joins objects, is good at identifying small clusters, but can provide sub-optimal performance for identifying a few large clusters. Conversely, top-down methods are good at identifying a few large clusters, but weaker at many small clusters. We seek to combine the strengths of both approaches, modifying top-down procedures with information gained from a preliminary bottom-up clustering.
The idea of a mutual cluster is used to produce a hybrid hierarchical clustering method. A top-down method is used, but with the constraint that mutual clusters must be preserved. This leads to a hierarchical clustering of the data in which the data are effectively split into a few large clusters (as is usually done by top-down methods) and detailed clustering structure is correctly identified (as is usually done by bottom-up methods).
The package can be used for several tasks:
tsvq
) and an implementations of Eisen's cluster
algorithm (eisenCluster
).
hybridHclust
Follow the instructions. Click on
http://cran.r-project.org/mirrors.html
under
Download. Pick a mirror closest to you. You probably want a
pre-compiled version. For windows, choose the
Windows (95 or later) link, select base
, and download
SetupR.exe
.
Installation takes 5-10 minutes.
R is a great package, and is worth knowing about!
R CMD INSTALL -l mylib hybridHclust_1.0-0.tar.gz
Download the zip file from the
hybridHclust
distribution site, start R, then pull down the
Packages
menu item and select
Install packages from local zip file
. A file chooser dialog
will
allow you to select the file
hybridHclust_0.1-0.zip
that you just saved in the previous step.
hybridHclust
For Unix/Linux, start R and type
library(hybridHclust,lib.loc="mylib")
In Windows, while in R pull down the Packages
menu item and
select Load package
. Select hybridHclust
. In Windows
you will find
it helpful to go the Misc
menu and turn off buffered output.
This
forces R output to be immediately written to the console.
You are now ready to use hybridHclust.
hybridHclust
demo(hybridHclust)
in R once the library has been loaded.
First, load the package and data:
library(hybridHclust) data(sorlie) data(sorlielabels)We will cluster 85 samples (columns, each with 456 genes), using a correlation distance measure defined as 1.0-correlation. This "correlation" distance commonly used in clustering microarray data is equivalent to squared Euclidean distance once each sample has been scaled to have mean 0 and sd 1.
x <- scale(sorlie) dmat <- 1-cor(x)Currently the code relies on integers as row and column names of distance matrix.
dimnames(dmat) <- list(1:nrow(dmat),1:nrow(dmat))Identify the mutual clusters. The
plot=TRUE
option displays
a bottom-up dendrogram with the mutual clusters identified.
mc1 <- mutualCluster(distances=dmat,plot=TRUE) print(mc1)This generates the following jpg plot, and output:
1 : 43 44 2 : 45 46 3 : 24 25 4 : 16 20 5 : 7 8 6 : 10 12 7 : 37 60 8 : 35 36 9 : 29 30 10 : 31 32 33 34 11 : 50 51 12 : 52 55 13 : 54 56 14 : 82 83 15 : 78 79 16 : 63 64 17 : 65 66Note that the
mutualCluster
function can take as an input
either the original data matrix x
(in which case Euclidean
distance is used) or a distance matrix (as above). As indicated
previously, this demo uses a correlation-based distance, which is
equivalent to squared Euclidean distance after scaling each observation.
Next, show distances between points belonging to mutual clusters:
get.distances(mc1)The printed output below shows the distance between each mutual cluster. In all cases except one, the mutual cluster is a pair of points.
[[1]] 44 43 0.6573144 [[2]] 46 45 0.6722525 [[3]] 25 24 0.4642499 [[4]] 20 16 0.5953205 [[5]] 8 7 0.4090045 [[6]] 12 10 0.4578701 [[7]] 60 37 0.6275236 [[8]] 36 35 0.4341527 [[9]] 30 29 0.3366078 [[10]] 31 32 33 32 0.2992227 33 0.1934151 0.2052296 34 0.2491986 0.2662368 0.1512860 [[11]] 51 50 0.6134367 [[12]] 55 52 0.7042852 [[13]] 56 54 0.7609525 [[14]] 83 82 0.5660623 [[15]] 79 78 0.6416045 [[16]] 64 63 0.5237544 [[17]] 66 65 0.554218
x
is transposed because we want to cluster its
columns, and hybridHclust
clusters rows. The
trace
option
is used to provide reassurance that something is happening.
hyb1 <- hybridHclust(t(x),mc1,trace=TRUE) plot(hyb1) table(hyb=cutree(hyb1,5),sorlie=sorlielabels)The
plot
command generates the following jpg plot. The last line above "cuts" the tree so
that 5 clusters are generated, and compares the resultant cluster labels
with the labels found in Sorlie et. al. (2001). Note that some clusters
agree reasonably well (eg hyb2 <-> sorlie3; hyb3 <->sorlie4) while others
disagree more (eg sorlie5 is mostly split across hyb4 and hyb5).
sorlie hyb 1 2 3 4 5 1 14 9 0 1 0 2 0 1 11 0 1 3 0 1 1 13 2 4 0 0 1 1 20 5 0 0 0 0 9
h1 <- hclust(as.dist(dmat),method='average')Fit a top-down model for the sake of comparison. We transpose x because we want to cluster columns (ie samples) and tsvq will cluster rows by default. Note that the
trace=TRUE
option
traces out the steps as they are executed. This can be reassuring when
working on large problems.
t1 <- tsvq(t(x),K=nrow(t(x)),row.labs=1:nrow(t(x)),trace=TRUE)Comparative analysis of the three dendrograms produced above. First we put the trees into a single object, and then extract a set of nested partitions for each of the three trees. Comparisons will be made in terms of the similarity of partitions of equal size, and within-group sums of squares.
treelist <- vector('list',3) treelist[[1]] <- list(tr=h1,name='Bottom-up') treelist[[2]] <- list(tr=t1,name='Top-down') treelist[[3]] <- list(tr=hyb1,name='Hybrid')Four "helper functions" are defined:
make.tree.partitions <- function(treelist,prange){ # use: make.tree.partitions(list(list(tr=hclust(dist(x)),name='hclust')),2:20) parts <- vector('list',length(treelist)) thenames <- NULL for (i in 1:length(treelist)){ parts[[i]] <- matrix(0,length(treelist[[i]]$tr$height)+1,max(prange)) for (j in prange) { if (j==1) parts[[i]][,j] <- rep(1,nrow(parts[[i]])) else if (j==nrow(parts[[i]])) parts[[i]][,j] <- 1:nrow(parts[[i]]) else parts[[i]][,j] <- cutree(treelist[[i]]$tr,j) } thenames <- c(thenames,treelist[[i]]$name) } list(parts,thenames,prange) } calc.partition.dists <- function(thelist,prange) { parts <- thelist[[1]] thenames <- thelist[[2]] #prange <- thelist[[3]] np <- length(parts) dtreemat <- matrix(0,np,np) dimnames(dtreemat) <- list(thenames,thenames) tdists <- NULL for (ii in prange){ for (i in 1:(np-1)) for (j in (i+1):np) dtreemat[i,j] <- dtreemat[j,i] <- dpart(parts[[i]][,ii],parts[[j]][,ii]) #print(as.dist(dtreemat)) tdists <- cbind(tdists,as.dist(dtreemat)) } tnames <- NULL for (i in 1:(np-1)) for (j in (i+1):np) tnames <- c(tnames,paste(thenames[i],thenames[j],sep=':')) rownames(tdists) <- tnames tdists } dpart <- function(p1,p2){ # much more efficient version that uses cross-tab of node ids for two trees # p1 and p2 are vectors of cluster labels. tt <- table(p1,p2) d <- 0 for (i in 1:nrow(tt)) for (j in 1:ncol(tt)){ if (iNow use these helper functions to compare the resultant trees, via comparison of partitions, and within-group sum of distances partitions <- make.tree.partitions(treelist,1:85)Plot the Hamming distances between the partitions of the 3 trees, for various numbers of clusters:whichplot <- 2:84 junk2 <- calc.partition.dists(partitions,whichplot) par(mfrow=c(1,1)) matplot(whichplot,t(junk2),type='l',col=1,lwd=2,xlab='Number of clusters', ylab='dissimilarity between clusterings',ylim=c(0,max(junk2)),xlim=c(1,85) ,log='x') matpoints(whichplot[1:9],t(junk2[,1:9]),pch=19,col=1) legend(mean(whichplot)/10,max(junk2),dimnames(junk2)[[1]], lty=1:nrow(junk2),lwd=2)The plot generated looks like this (jpg).Now plot within group distances vs number of clusters.
tot.dist.from.mean <- matrix(0,length(partitions[[2]]),85) for (i in 1:length(partitions[[2]])){ tot.dist.from.mean[i,] <- calc.sum.dist(dmat,partitions[[1]][[i]]) } par(mfrow=c(1,2)) mylty<-1:6 mycol <- c('black','red','blue') matplot(t(tot.dist.from.mean),type='l',xlab='number of clusters', ylab='within cluster sum of distances',col=mycol,lty=mylty) legend(20,max(tot.dist.from.mean),partitions[[2]],lty=mylty[1:3],col=mycol[1:3]) relative <- tot.dist.from.mean for (i in 1:ncol(relative)) relative[,i] <- relative[,i] - min(relative[,i]) matplot(t(relative),type='l',col=mycol,xlab='number of clusters', ylab='within cluster sum of distances (relative to min)',lty=mylty,lwd=1.5) legend(20,max(relative),partitions[[2]],lty=mylty[1:3],col=mycol[1:3],lwd=1.5)The plot generated looks like this (jpg).Comparison with sorlie labels:
for (i in 1:3) cat(treelist[[i]]$name, dpart(sorlielabels,cutree(treelist[[i]]$tr,5)),'\n')This produces the following text output:Bottom-up 0.1327731 Top-down 0.1686275 Hybrid 0.1703081