1 mtcars - Motor Trend Car Road Tests
2 iris - Edgar Anderson’s Iris Data
3 New York Air Quality Measurements (airquality)
4 ALL - Gentleman et al. 2004
5 votes.repub - Votes for Republican Candidate in Presidential Elections
- 5.0.1 Background
- 5.0.2 Heatmap
6 animals - Attributes of Animals
- 6.0.1 Background
- 6.0.2 Heatmap
7 sessionInfo

library(heatmaply)
library(dendextend)

1 mtcars - Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

mtcars2 <- datasets::mtcars
mtcars2$am <- factor(mtcars2$am)
mtcars2$gear <- factor(mtcars2$gear)
mtcars2$vs <- factor(mtcars2$vs)

library(heatmaply)
heatmaply(percentize(mtcars2), 
          xlab = "Features", ylab = "Cars", 
          main = "Motor Trend Car Road Tests",
          k_col = 2, k_row = NA,
          margins = c(60,100,40,20) )

For visualizing the correlation matrix we wish to use divergent color palette as well as set the limits.

library(heatmaply)

heatmaply(cor(mtcars), margins = c(40, 40, 0, 0),
          k_col = 2, k_row = 2,
          colors = BrBG,
          limits = c(-1,1))

2 iris - Edgar Anderson’s Iris Data

The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. (from ?iris)

The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. Since the values represent length it makes sense to have them start at 0 and end a bit above the highest value in the data-set (which is 7.9).

Notice the use of find_dend and seriate_dendrogram to find the “best” linkage function. Looking at the performance of dend_expend shows that complete gets a cophenetic correlation of 0.72 while the best option (average) gets 0.876.

iris <- datasets::iris

library(heatmaply)
library(dendextend)

iris2 <- iris[,-5]
rownames(iris2) <- 1:150
iris_dist <- iris2 %>% dist 
dend <- iris_dist %>% find_dend %>% seriate_dendrogram(., iris_dist)
dend_expend(iris_dist)$performance

#>   dist_methods hclust_methods     optim
#> 1      unknown         ward.D 0.8638236
#> 2      unknown        ward.D2 0.8728283
#> 3      unknown         single 0.8638787
#> 4      unknown       complete 0.7269857
#> 5      unknown        average 0.8769561
#> 6      unknown       mcquitty 0.8679766
#> 7      unknown         median 0.8622006
#> 8      unknown       centroid 0.8746715

heatmaply(iris, limits = c(0,8),
            xlab = "Lengths", ylab = "Flowers", 
            main = "Edgar Anderson's Iris Data",
          Rowv = dend,
          margins = c(85, 40),
          grid_gap = 0.2, k_row = 3)

#> Warning: 'scatter' objects don't have these attributes: 'xgap', 'ygap'
#> Valid attributes include:
#> 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'

3 New York Air Quality Measurements (airquality)

Daily air quality measurements in New York, May to September 1973.

The plot shows us that most missing values are in the Ozone variable, and the month distribution (available in the row-side notation) shows that most missing values are from June. Notice the use of grid_gap and grey colors in order to aid in the visualization.

library(heatmaply)


airquality2 <- datasets::airquality
airquality2[,c(1:4,6)] <- is.na10(airquality2[,c(1:4,6)])
airquality2[,5] <- factor(airquality2[,5])

heatmaply(airquality2, grid_gap = 1,
            xlab = "Features", ylab = "Days", 
            main = "Missing values in 'New York Air Quality Measurements'",
            k_col =3, k_row = 3,
            margins = c(55, 30),
            colors = c("grey80", "grey20"))

#> Warning: 'scatter' objects don't have these attributes: 'xgap', 'ygap'
#> Valid attributes include:
#> 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'

# warning - using grid_color cannot handle a large matrix!
# airquality[1:10,] %>% is.na10 %>% 
#   heatmaply(color = c("white","black"), grid_color = "grey",
#             k_col =3, k_row = 3,
#             margins = c(40, 50)) 
# airquality %>% is.na10 %>% 
#   heatmaply(color = c("grey80", "grey20"), # grid_color = "grey",
#             k_col =3, k_row = 3,
#             margins = c(40, 50)) 
#

4 ALL - Gentleman et al. 2004

4.1 Background

This document uses R to analyse an Acute lymphocytic leukemia (ALL) microarray data-set, producing a heatmap (with dendrograms) of genes deferentially expressed between two types of leukemia. The creation of the data and code for static figures is based on the code available from here.

The original citation for the raw data is “Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival” by Chiaretti et al. Blood 2004. (PMID: 14684422). This document demonstrates the recreation of Figure 2 Heatmap from the paper Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004..

4.2 Static cluster heatmap

4.2.1 Getting the data

# Get needed packages:
if(!require("ALL")) {
  source("http://www.bioconductor.org/biocLite.R")
  biocLite("ALL")
}
if(!require("limma")) {
  source("http://www.bioconductor.org/biocLite.R")
  biocLite("limma")
}



library("ALL")
data("ALL")
eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]
library("limma")
f <- factor(as.character(eset$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset,design))
selected  <- p.adjust(fit$p.value[, 2]) <0.05
esetSel <- eset [selected, ]
color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
hm_data <- exprs(esetSel)

4.2.2 stats::heatmap

The colors are a bit off compared with the original plot, but they are pretty close.

heatmap(hm_data, col=topo.colors(100), ColSideColors=patientcolors)

4.2.3 gplots::heatmap.2

Here also, the colors are a bit off compared with the original plot, but they are pretty close.

library("gplots")
heatmap.2(hm_data, col=topo.colors(100), scale="row", ColSideColors=patientcolors,
          key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)

4.3 Interactive cluster heatmap

4.3.1 heatmaply - replicating gplots::heatmap.2

Several slight changes need to be made. We should use color instead of col, also the seriate parameter should use “mean” and the margin parameter needs to be set. But once done, the results are very similar.

library(heatmaply)


heatmaply(hm_data, color=topo.colors(100), ColSideColors=patientcolors, 
          seriate = "mean", scale="row", margin = c(65,120,10,10))

#> Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
#> scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
#> col_side_colors is currently not implemented (due to an issue in plotly).
#> We hope this would get resolved in future releases.

        # %>% layout(autosize = F, width = 500, height = 500)

4.3.2 Using heatmaply’s defaults

The heatmaply package tries to offer better defaults.

Instead of topo.colors (or the default “heat.colors” in heatmap.2), heatmaply uses the superior viridis color palette:

library(heatmaply)

heatmaply(hm_data, ColSideColors=patientcolors, 
          seriate = "mean", scale="row", margin = c(65,120,10,10))

#> Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
#> scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
#> col_side_colors is currently not implemented (due to an issue in plotly).
#> We hope this would get resolved in future releases.

Instead of ordering based on the mean, heatmaply uses optimal-leaf-order, and the branches of the dendrograms can be colors to highlight a pre-defined number of assumed clusters:

library(heatmaply)

heatmaply(hm_data, ColSideColors=patientcolors, 
          fontsize_row = 5,
          scale="row", margin = c(65,120,10,10), 
          k_col = 2, k_row = 5)

#> Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
#> scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
#> col_side_colors is currently not implemented (due to an issue in plotly).
#> We hope this would get resolved in future releases.

heatmaply(hm_data, ColSideColors=patientcolors, 
          fontsize_row = 5,
          scale="row", margin = c(50,50,10,10), 
          row_dend_left = TRUE, plot_method = "plotly",
          k_col = 2, k_row = 5)

5 votes.repub - Votes for Republican Candidate in Presidential Elections

5.0.1 Background

This is a data frame with the percentage of votes given to the republican candidate in presidential elections from 1856 to 1976. Rows represent the 50 states, and columns the 31 elections.

Source: S. Peterson (1973): A Statistical History of the American Presidential Elections. New York: Frederick Ungar Publishing Co. Data from 1964 to 1976 is from R. M. Scammon, American Votes 12, Congressional Quarterly.

Define variables:

votes.repub <- cluster::votes.repub

These data can be visualized using a (costumed made) parallel coordinates plot:

years <- as.numeric(gsub("X", "", colnames(votes.repub)))

par(las = 2, mar = c(4.5, 3, 3, 2) + 0.1, cex = .8)
# MASS::parcoord(votes.repub, var.label = FALSE, lwd = 1)
matplot(1L:ncol(votes.repub), t(votes.repub), type = "l", col = 1, lty = 1,
        axes = F, xlab = "", ylab = "")
axis(1, at = seq_along(years), labels = years)
axis(2)
# Add Title
title("Votes for Republican Candidate\n in Presidential Elections \n (each line is a country - over the years)")

5.0.2 Heatmap

This is a nice example when the parallel coordinates plot has some serious limitations: it does not help us detect the states, we fail to see the missing value patterns, and it is tricky to see clusters in general (due to the large number of threads).

For these data, it can be quite helpful to see a heatmap of the votes across the years. The ordering of the rows is tricky. First, the distance of the vectors (later used for the clustering) should be done after transformation (since we are dealing with proportion of votes). In this case, I used the arcsin transformation (a logit transformation could also work, but the arcsin is safer for dealing with 0/1 observations). But given the clusters, we wish to order the leaves (as much as possible), in order to take into account the missing value clusterings. So we, in fact, have two clusters, one for the raw values, and another for the “shadow matrix” (i.e.: the matrix with 0/1, indicating if a value was missing or not).

# votes.repub[is.na(votes.repub)] <- 50

library(heatmaply)

heatmaply(votes.repub, 
          margins = c(60,150,110,10),
          k_row = NA,
          limits = c(0,100),
          main = "Votes for\n Republican Presidential Candidate\n (clustered using complete)",
          srtCol = 60,
          dendrogram = "row",
          ylab = "% Votes for Republican\n Presidential Candidate",
          colors = colorspace::diverge_hcl
         )

          # RowSideColors = rev(labels_colors(dend)), # to add nice colored strips

6 animals - Attributes of Animals

6.0.1 Background

This data set considers 6 binary attributes for 20 animals.

see Struyf, Hubert & Rousseeuw (1996), in agnes.