Tutorial for hybridHclust

Hugh Chipman, Rob Tibshirani

1. Introduction

The package hybridHclust provides R functions for carrying out hybrid hierarchical clustering using "mutual custers". Such clustering is useful when different levels of clustering resolution are required. For example, interest may focus both on tightly clustered small sets of observations and larger, broad clusters.

The methodology is described in the paper:

2. The Basic idea

A mutual cluster is a set of points whose largest within-group distance is smaller than the distance to the nearest point outside the set. Many bottom-up hierarchical clustering methods (e.g. hclust with single, average or complete linkage) preserve mutual clusters, in the sense that each mutual cluster is subdivided only after it is isolated as a cluster in the dendrogram.

Bottom-up clustering, which successively joins objects, is good at identifying small clusters, but can provide sub-optimal performance for identifying a few large clusters. Conversely, top-down methods are good at identifying a few large clusters, but weaker at many small clusters. We seek to combine the strengths of both approaches, modifying top-down procedures with information gained from a preliminary bottom-up clustering.

The idea of a mutual cluster is used to produce a hybrid hierarchical clustering method. A top-down method is used, but with the constraint that mutual clusters must be preserved. This leads to a hierarchical clustering of the data in which the data are effectively split into a few large clusters (as is usually done by top-down methods) and detailed clustering structure is correctly identified (as is usually done by bottom-up methods).

The package can be used for several tasks:

3. Installing hybridHclust

Users familiar with R can skip this section and simply install the package from CRAN.

  1. You first need to install a recent version of the R statistical package. This is free, and can be found at The R Project.

    Follow the instructions. Click on http://cran.r-project.org/mirrors.html under Download. Pick a mirror closest to you. You probably want a pre-compiled version. For windows, choose the Windows (95 or later) link, select base, and download SetupR.exe. Installation takes 5-10 minutes.

    R is a great package, and is worth knowing about!

  2. hybridHclust can be installed either directly from CRAN or can be downloaded and installed as described below.
  3. Download the appropriate Unix/Linux version or Windows version for your platform from the hybridHclust distribution site. Often, browsers offer you a choice of opening the file or saving it. Elect to save it and remember the name of the folder where you saved it.
  4. For Unix/Linux type
    R CMD INSTALL -l mylib hybridHclust_1.0-0.tar.gz
    

    Download the zip file from the hybridHclust distribution site, start R, then pull down the Packages menu item and select Install packages from local zip file. A file chooser dialog will allow you to select the file hybridHclust_0.1-0.zip that you just saved in the previous step.

The above concludes the installation process and needs to be done just once.

4. Using hybridHclust

For Unix/Linux, start R and type

library(hybridHclust,lib.loc="mylib")

In Windows, while in R pull down the Packages menu item and select Load package. Select hybridHclust. In Windows you will find it helpful to go the Misc menu and turn off buffered output. This forces R output to be immediately written to the console.

You are now ready to use hybridHclust.

5. Demo of hybridHclust

Note: The R code in the demo below can be executed by typing demo(hybridHclust) in R once the library has been loaded.

First, load the package and data:

library(hybridHclust)
data(sorlie)
data(sorlielabels)
We will cluster 85 samples (columns, each with 456 genes), using a correlation distance measure defined as 1.0-correlation. This "correlation" distance commonly used in clustering microarray data is equivalent to squared Euclidean distance once each sample has been scaled to have mean 0 and sd 1.
x <- scale(sorlie) 
dmat <- 1-cor(x)
Currently the code relies on integers as row and column names of distance matrix.
dimnames(dmat) <- list(1:nrow(dmat),1:nrow(dmat))
Identify the mutual clusters. The plot=TRUE option displays a bottom-up dendrogram with the mutual clusters identified.
mc1 <- mutualCluster(distances=dmat,plot=TRUE)
print(mc1)
This generates the following jpg plot, and output:
1 : 43 44 
2 : 45 46 
3 : 24 25 
4 : 16 20 
5 : 7 8 
6 : 10 12 
7 : 37 60 
8 : 35 36 
9 : 29 30 
10 : 31 32 33 34 
11 : 50 51 
12 : 52 55 
13 : 54 56 
14 : 82 83 
15 : 78 79 
16 : 63 64 
17 : 65 66 
Note that the mutualCluster function can take as an input either the original data matrix x (in which case Euclidean distance is used) or a distance matrix (as above). As indicated previously, this demo uses a correlation-based distance, which is equivalent to squared Euclidean distance after scaling each observation.

Next, show distances between points belonging to mutual clusters:

get.distances(mc1)
The printed output below shows the distance between each mutual cluster. In all cases except one, the mutual cluster is a pair of points.
[[1]]
          44
43 0.6573144

[[2]]
          46
45 0.6722525

[[3]]
          25
24 0.4642499

[[4]]
          20
16 0.5953205

[[5]]
          8
7 0.4090045

[[6]]
          12
10 0.4578701

[[7]]
          60
37 0.6275236

[[8]]
          36
35 0.4341527

[[9]]
          30
29 0.3366078

[[10]]
          31        32        33
32 0.2992227                    
33 0.1934151 0.2052296          
34 0.2491986 0.2662368 0.1512860

[[11]]
          51
50 0.6134367

[[12]]
          55
52 0.7042852

[[13]]
          56
54 0.7609525

[[14]]
          83
82 0.5660623

[[15]]
          79
78 0.6416045

[[16]]
          64
63 0.5237544

[[17]]
         66
65 0.554218

Hybrid hierarchical clustering:

Fit the hybrid hierarchical model. The matrix x is transposed because we want to cluster its columns, and hybridHclust clusters rows. The trace option is used to provide reassurance that something is happening.
hyb1 <- hybridHclust(t(x),mc1,trace=TRUE)
plot(hyb1)
table(hyb=cutree(hyb1,5),sorlie=sorlielabels)
The plot command generates the following jpg plot. The last line above "cuts" the tree so that 5 clusters are generated, and compares the resultant cluster labels with the labels found in Sorlie et. al. (2001). Note that some clusters agree reasonably well (eg hyb2 <-> sorlie3; hyb3 <->sorlie4) while others disagree more (eg sorlie5 is mostly split across hyb4 and hyb5).
   sorlie
hyb  1  2  3  4  5
  1 14  9  0  1  0
  2  0  1 11  0  1
  3  0  1  1 13  2
  4  0  0  1  1 20
  5  0  0  0  0  9

Comparison of different hierarchical clustering methods

Calculate and store the bottom-up clustering object using distances found in "dmat". We'll use this later for a comparison of top-down, bottom-up, and hybrid methods.
h1 <- hclust(as.dist(dmat),method='average')
Fit a top-down model for the sake of comparison. We transpose x because we want to cluster columns (ie samples) and tsvq will cluster rows by default. Note that the trace=TRUE option traces out the steps as they are executed. This can be reassuring when working on large problems.
t1 <- tsvq(t(x),K=nrow(t(x)),row.labs=1:nrow(t(x)),trace=TRUE)
Comparative analysis of the three dendrograms produced above. First we put the trees into a single object, and then extract a set of nested partitions for each of the three trees. Comparisons will be made in terms of the similarity of partitions of equal size, and within-group sums of squares.
treelist <- vector('list',3)
treelist[[1]] <- list(tr=h1,name='Bottom-up')
treelist[[2]] <- list(tr=t1,name='Top-down')
treelist[[3]] <- list(tr=hyb1,name='Hybrid')
Four "helper functions" are defined:
make.tree.partitions <- function(treelist,prange){
# use: make.tree.partitions(list(list(tr=hclust(dist(x)),name='hclust')),2:20)
	parts <- vector('list',length(treelist))
	thenames <- NULL
	for (i in 1:length(treelist)){
  	parts[[i]] <- matrix(0,length(treelist[[i]]$tr$height)+1,max(prange))
  	for (j in prange) {
			if (j==1) parts[[i]][,j] <- rep(1,nrow(parts[[i]]))
			else if (j==nrow(parts[[i]])) parts[[i]][,j] <- 1:nrow(parts[[i]])
			else parts[[i]][,j] <- cutree(treelist[[i]]$tr,j)
		}
  	thenames <- c(thenames,treelist[[i]]$name)
	}
	list(parts,thenames,prange)
}
 
calc.partition.dists <- function(thelist,prange)
{
	parts <- thelist[[1]]
	thenames <- thelist[[2]]
	#prange <- thelist[[3]]
	np <- length(parts)
	dtreemat <- matrix(0,np,np)
	dimnames(dtreemat) <- list(thenames,thenames)
	tdists <- NULL
	for (ii in prange){
	for (i in 1:(np-1)) for (j in (i+1):np)
  	dtreemat[i,j] <- dtreemat[j,i] <-
    	dpart(parts[[i]][,ii],parts[[j]][,ii])
  	#print(as.dist(dtreemat))
  	tdists <- cbind(tdists,as.dist(dtreemat))
	}
	tnames <- NULL
	for (i in 1:(np-1)) for (j in (i+1):np)
  	tnames <- c(tnames,paste(thenames[i],thenames[j],sep=':'))
	rownames(tdists) <- tnames
	tdists
}

dpart <- function(p1,p2){
	# much more efficient version that uses cross-tab of node ids for two trees
	# p1 and p2 are vectors of cluster labels.
	tt <- table(p1,p2)
	d <- 0
	for (i in 1:nrow(tt))
		for (j in 1:ncol(tt)){
			if (i

Now use these helper functions to 
compare the resultant trees, via comparison of partitions, and
within-group sum of distances
partitions <- make.tree.partitions(treelist,1:85)
Plot the Hamming distances between the partitions of the 3 trees, for various numbers of clusters:
whichplot <- 2:84
junk2 <- calc.partition.dists(partitions,whichplot)
par(mfrow=c(1,1))
matplot(whichplot,t(junk2),type='l',col=1,lwd=2,xlab='Number of clusters',
  ylab='dissimilarity between clusterings',ylim=c(0,max(junk2)),xlim=c(1,85)
  ,log='x')
matpoints(whichplot[1:9],t(junk2[,1:9]),pch=19,col=1)
legend(mean(whichplot)/10,max(junk2),dimnames(junk2)[[1]],
  lty=1:nrow(junk2),lwd=2)
The plot generated looks like this (jpg).

Now plot within group distances vs number of clusters.

tot.dist.from.mean <- matrix(0,length(partitions[[2]]),85)
for (i in 1:length(partitions[[2]])){
    tot.dist.from.mean[i,] <- calc.sum.dist(dmat,partitions[[1]][[i]])
}
par(mfrow=c(1,2))
mylty<-1:6
mycol <- c('black','red','blue')
matplot(t(tot.dist.from.mean),type='l',xlab='number of clusters',
  ylab='within cluster sum of distances',col=mycol,lty=mylty)
legend(20,max(tot.dist.from.mean),partitions[[2]],lty=mylty[1:3],col=mycol[1:3])
relative <- tot.dist.from.mean
for (i in 1:ncol(relative)) 
  relative[,i] <- relative[,i] - min(relative[,i])
matplot(t(relative),type='l',col=mycol,xlab='number of clusters',
  ylab='within cluster sum of distances (relative to min)',lty=mylty,lwd=1.5)
legend(20,max(relative),partitions[[2]],lty=mylty[1:3],col=mycol[1:3],lwd=1.5)

The plot generated looks like this (jpg).

Comparison with sorlie labels:

for (i in 1:3) 
  cat(treelist[[i]]$name, dpart(sorlielabels,cutree(treelist[[i]]$tr,5)),'\n')
This produces the following text output:
Bottom-up 0.1327731 
Top-down 0.1686275 
Hybrid 0.1703081