library(heatmaply)
library(dendextend)The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
mtcars2 <- datasets::mtcars
mtcars2$am <- factor(mtcars2$am)
mtcars2$gear <- factor(mtcars2$gear)
mtcars2$vs <- factor(mtcars2$vs)
library(heatmaply)
heatmaply(percentize(mtcars2),
xlab = "Features", ylab = "Cars",
main = "Motor Trend Car Road Tests",
k_col = 2, k_row = NA,
margins = c(60,100,40,20) )For visualizing the correlation matrix we wish to use divergent color palette as well as set the limits.
library(heatmaply)
heatmaply(cor(mtcars), margins = c(40, 40, 0, 0),
k_col = 2, k_row = 2,
colors = BrBG,
limits = c(-1,1))The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. (from
?iris)
The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. Since the values represent length it makes sense to have them start at 0 and end a bit above the highest value in the data-set (which is 7.9).
Notice the use of find_dend and seriate_dendrogram to find the “best” linkage function. Looking at the performance of dend_expend shows that complete gets a cophenetic correlation of 0.72 while the best option (average) gets 0.876.
iris <- datasets::iris
library(heatmaply)
library(dendextend)
iris2 <- iris[,-5]
rownames(iris2) <- 1:150
iris_dist <- iris2 %>% dist
dend <- iris_dist %>% find_dend %>% seriate_dendrogram(., iris_dist)
dend_expend(iris_dist)$performance#> dist_methods hclust_methods optim
#> 1 unknown ward.D 0.8638236
#> 2 unknown ward.D2 0.8728283
#> 3 unknown single 0.8638787
#> 4 unknown complete 0.7269857
#> 5 unknown average 0.8769561
#> 6 unknown mcquitty 0.8679766
#> 7 unknown median 0.8622006
#> 8 unknown centroid 0.8746715
heatmaply(iris, limits = c(0,8),
xlab = "Lengths", ylab = "Flowers",
main = "Edgar Anderson's Iris Data",
Rowv = dend,
margins = c(85, 40),
grid_gap = 0.2, k_row = 3)#> Warning: 'scatter' objects don't have these attributes: 'xgap', 'ygap'
#> Valid attributes include:
#> 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
Daily air quality measurements in New York, May to September 1973.
The plot shows us that most missing values are in the Ozone variable, and the month distribution (available in the row-side notation) shows that most missing values are from June. Notice the use of grid_gap and grey colors in order to aid in the visualization.
library(heatmaply)
airquality2 <- datasets::airquality
airquality2[,c(1:4,6)] <- is.na10(airquality2[,c(1:4,6)])
airquality2[,5] <- factor(airquality2[,5])
heatmaply(airquality2, grid_gap = 1,
xlab = "Features", ylab = "Days",
main = "Missing values in 'New York Air Quality Measurements'",
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))#> Warning: 'scatter' objects don't have these attributes: 'xgap', 'ygap'
#> Valid attributes include:
#> 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
# warning - using grid_color cannot handle a large matrix!
# airquality[1:10,] %>% is.na10 %>%
# heatmaply(color = c("white","black"), grid_color = "grey",
# k_col =3, k_row = 3,
# margins = c(40, 50))
# airquality %>% is.na10 %>%
# heatmaply(color = c("grey80", "grey20"), # grid_color = "grey",
# k_col =3, k_row = 3,
# margins = c(40, 50))
# This document uses R to analyse an Acute lymphocytic leukemia (ALL) microarray data-set, producing a heatmap (with dendrograms) of genes deferentially expressed between two types of leukemia. The creation of the data and code for static figures is based on the code available from here.
The original citation for the raw data is “Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival” by Chiaretti et al. Blood 2004. (PMID: 14684422). This document demonstrates the recreation of Figure 2 Heatmap from the paper Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004..
# Get needed packages:
if(!require("ALL")) {
source("http://www.bioconductor.org/biocLite.R")
biocLite("ALL")
}
if(!require("limma")) {
source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
}
library("ALL")
data("ALL")
eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]
library("limma")
f <- factor(as.character(eset$mol.biol))
design <- model.matrix(~f)
fit <- eBayes(lmFit(eset,design))
selected <- p.adjust(fit$p.value[, 2]) <0.05
esetSel <- eset [selected, ]
color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" }
patientcolors <- unlist(lapply(esetSel$mol.bio, color.map))
hm_data <- exprs(esetSel)The colors are a bit off compared with the original plot, but they are pretty close.
heatmap(hm_data, col=topo.colors(100), ColSideColors=patientcolors)Here also, the colors are a bit off compared with the original plot, but they are pretty close.
library("gplots")
heatmap.2(hm_data, col=topo.colors(100), scale="row", ColSideColors=patientcolors,
key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)Several slight changes need to be made. We should use color instead of col, also the seriate parameter should use “mean” and the margin parameter needs to be set. But once done, the results are very similar.
library(heatmaply)
heatmaply(hm_data, color=topo.colors(100), ColSideColors=patientcolors,
seriate = "mean", scale="row", margin = c(65,120,10,10)) #> Warning in heatmaply.heatmapr(hm, colors = colors, limits = limits,
#> scale_fill_gradient_fun = scale_fill_gradient_fun, : The hover text for
#> col_side_colors is currently not implemented (due to an issue in plotly).
#> We hope this would get resolved in future releases.