Last updated: 2018-07-11

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180618)

The command set.seed(20180618) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 822cb72
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/tutorial_cache/
    Ignored:    data/.DS_Store
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	822cb72	tk382	2018-07-11	wflow_publish(c(“analysis/tutorial.Rmd”, “analysis/index.Rmd”))
html	681825a	tk382	2018-07-11	Build site.
Rmd	8c55246	tk382	2018-07-11	add tutorial section
html	8c55246	tk382	2018-07-11	add tutorial section

Load the SLSL library.

#library(SLSL)
library(inline)
library(matrixStats)
library(quadprog)
library(irlba)
library(ggplot2)
library(dplyr)
library(reshape)
library(caret)
library(fossil)
library(pracma)
library(igraph)
library(Rtsne)
library(gplots)
library(broom)
library(abind)
library(stargazer)
library(scatterplot3d)
library(diceR)
library(parallel)

set.seed(1)
R.utils::sourceDirectory("~/Dropbox/TaeProject/VariableError/SCNoisyClustering/R/", modifiedOnly=FALSE)
Rcpp::sourceCpp('~/Dropbox/TaeProject/VariableError/SCNoisyClustering/src/SCNoisyClustering.cpp')
  
heat = function(S){
  ggplot(melt(S), aes(x=X1, y=X2, fill=value)) + geom_tile() +
    scale_fill_gradient2() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("") + ylab("")
}

Small, good data sets

First, prepare your data set. It should be a normalized count matrix in units such as TPM, CPM, RPKM, or FPKM. True label data is available for the sample data Yan et al. (L. Yan et al. 2013), and we will evaluate the performance at the end as an example, but the label information is not expected for most cases.

Read Data

load('../data/Yan.rda')
X         = as.matrix(yan)
truelabel = as.character(ann$cell_type1)
numClust  = 6
rm(ann, yan)

The data should look something like this.

X[1:4,1:4]

         Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM. Zygote..1.RPKM.
C9orf152             0.0             0.0             0.0             0.0
RPS11             1219.9          1021.1           931.6           875.4
ELMO2                7.0            12.2             9.3            52.1
CREB3L1              1.0             1.4             1.9             1.8

\(X\) must be a matrix with dimension \(p \times n\) where \(p\) is the gene count and \(n\) is the cell count. Each cell is located in each column, and each gene is located in each row. The sample data has the gene names as the row names, but this is not required unless you plan to use the reference panel. The column names are also not required because the package assumes the cell names contain no information about the cell subtypes.

Parameter Set-Up

The SLSL wrapper function can be used like below. Below are the explanations of the parameters available for you to fine-tune.

First, you can specify the number of clusters. The default is “NA”, and SLSL will decide this for you based on the eigenvalues of the learned similarity matrix.

numClust = length(unique(truelabel)) #4

A knn parameter \(k\) determines how many neighbors the algorithm uses to account for the local structure. The default is k = NA and algorithm will automatically assign it as max(10, number of cells / 20). This parameter is used in building the kernel matrices and in network diffusion.

k = NA

Then decide if you’d like to filter the genes. For each gene, SLSL counts the number of cells that have zero count, and if the proportion of such cells is higher than filter_p1 or lower than filter_p2, such gene is removed. This does not influence the clustering result much, unless filter_p1 is too low or filter_p2 is too high, and reduces the computational burden of the algorithm. The default choice of 0.9 and 0 reasonably selects the genes that do not hold much information for clustering and remove them.

filter = TRUE #or FALSE
filter_p1 = 0.9 #any value 0 and 1, but something between 0.85 - 0.95 is recommended
filter_p2 = 0 #any value between 0 and 1, but something less than 0.1 is recommended

Hicks 2017 showed that a large amount of cell-to-cell variation comes from the detection rate, or the proportion of genes that have greater than zero count. For example, first principal component of the data is often correlated with the zero counts of each cell.

Hicks explains that log-transforming the datawith pseudo-counts introduces the bias because the mean counts are no longer the same across the cells.

The first singular vector is plotted below against the detection rate. It is shown that after centering each column of the log transformed matrix removes part of the correlation between the detection rate and the first singular vector. This correction influences the Euclidean distance matrix, and therefore f

library(irlba)
v = irlba(log(X+1), 1)$v[,1]
det = 1-colSums(X==0)/nrow(X)
plot(v ~ det)

Expand here to see past versions of pca-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

newX = scale(log(X+1), scale=F)
newv = irlba(newX, 1)$v[,1]
plot(newv~det)

Expand here to see past versions of pca-2.png:

Version	Author	Date
8c55246	tk382	2018-07-11

One way to correct this is to regress out the zero counts of each cell. However, as shown above, some data sets have non-linear relationship with the zero counts, and sometimes it’s better to transform the data accordingly.

When correct_detection_rate parameter is set to TRUE, the residuals after regressing out the zero counts are used instead of the log of counts.

correct_detection_rate = TRUE #TRUE or FALSE

Next argument is kernel_type. This determines the method to meaure the cell to cell distance. SLSL algorithm combines information from many distance matrices computed from different kernel parameters, where the default setting uses 18 different combinations of the kernel parameters. When kernel_type is specified as one of “spearman”, “pearson”, or “euclidean”, SLSL will learn similarity matrix based on those 18 kernel matrices using the specified distance masure. When kernel_type is specified as “combined” (default setting), SLSL will compute three distance matrices using all three distance measures and \(18 \times 3\) kernel matrices to learn the similarity matrix. In other words, kernel_type = “combined” requires more memory and computational power, but it will reach higher accuracy by accounting for complex cell to cell variability.

kernel_type = "combined" #or spearman or pearson or euclidean

The user can also specify the kernel parameters. Denoting the distance from cell \(i\) and cell \(j\) as \(D_{ij}\), \(K_{ij}\) given kernel parameters \(k\) and \(\sigma\) is \[K_{ij} = \frac{1}{\sqrt{2\pi}\epsilon_{ij}} exp \left(-\frac{D_{ij}}{2\epsilon^2_{ij}}\right)\] where \[\epsilon_{ij} = \frac{(\mu_i + \mu_j)\sigma}{2}\]

\[\mu_{i} = \frac{\sum_{j \in KNN(i, k)} \|x_i - x_j\|^2_2}{k}\] \(KNN(x_i, k)\) is the \(k\) nearest neighbors of cell \(i\), and \(x_i\) is the gene expression level profile of cell \(i\). Therefore, the kernel parameters \(k\) determines how many neighbors of each cell to use to account for the local structure, and another parameter \(\sigma\) determines how to scale the distance to account for the non-linearity.

klist = seq(15, 25, by=5)
sigmalist = seq(1, 2, by=0.2)

\(\tau\) and \(\gamma\) are the penalty parameters for similarity learning. SLSL combines multiple kernel matrices by giving each of them different weights to reach maximum sparsity (\(\tau\)). Some past works showed that imposing penalty \(gamma\) on the Frobenius norm of the learned similarity matrix. This prevents the similarity matrix from being too close to the identity matrix (assigning each cell to a separate cluster). However, empirically, removing this penalty and setting \(\gamma = 0\) performed well. \(\tau\) is set to 5 as a default, and as long as it’s not too small (we recommend \(\tau > 1\)), the algorithm works well.

tau = 5
gamma = 0

Lastly, verbose should be set to TRUE if the user would like to see the progress of the algorithm. measuretime should be set to TRUE if the user would like to see how long each step of the algorithm takes. The default settings are FALSE for both parameters, but here, we set them TRUE to see what they do.

verbose = TRUE
measuretime = TRUE

Run the Function

Using the prespecified parameters above, run the SLSL function.

out = SLSL(X                      = X,
           numClust               = numClust,
           k                      = k,
           filter                 = filter,
           filter_p1              = filter_p1,
           filter_p2              = filter_p2,
           correct_detection_rate = correct_detection_rate,
           kernel_type            = kernel_type,
           klist                  = klist,
           sigmalist              = sigmalist,
           tau                    = tau,
           gamma                  = gamma,
           verbose                = verbose,
           measuretime            = measuretime
           )

[1] "constructing kernel.."
[1] "optimizing.."
[1] "network diffusion.."
[1] "dimension reduction.."
gene filter & numClust 0.34 
constructing kernel 2.39 
get S 0.03 
network diffusion 0.1 
tsne 0.17

Or, you can also use default parameters, and this will work as well.

out = SLSL(X)

Analyze the Result

First, let’s evaluate the clustering result. This is not applicable when true labels are not available.

table(out$result, as.factor(truelabel))

   
    16cell 2cell 4cell 8cell blast zygote
  1      2     0     0     4     0      0
  2      0     0     0     0    30      0
  3     14     0     0     0     0      0
  4      0     0     0    16     0      0
  5      0     0    12     0     0      0
  6      0     6     0     0     0      6

adj.rand.index(out$result, as.numeric(as.factor(truelabel)))

[1] 0.8954618

Let’s take a look at the heatmap of the similarity matrix. Note that the cells have been arranged according to the true groupings. The result seems pretty good, except the two cells around 50 that have clearly been misclassified.

heat(out$S)

Expand here to see past versions of similarity_matrix-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

Now let’s visualize the result through the dimension reduction result. The true label information allows us to analyze the errors from the algorithm. For example, the zygotes and 2-cells were clustered into one group, while there are two outliers in the 16 cells that are close to 8 cells.

tsne = out$tsne
df = as.data.frame(tsne)
df$truelabel = as.factor(truelabel)
df$result = as.factor(out$result)

ggplot(df, aes(x=V3, y=V2, col=truelabel)) + geom_point()

Expand here to see past versions of dimred-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

ggplot(df, aes(x=V3, y=V2, col=result)) + geom_point()

Expand here to see past versions of dimred-2.png:

Version	Author	Date
8c55246	tk382	2018-07-11

scatterplot3d(x=df$V2, y=df$V3, z=df$V4, color = as.numeric(as.factor(truelabel)))

Expand here to see past versions of dimred-3.png:

Version	Author	Date
8c55246	tk382	2018-07-11

scatterplot3d(x=df$V2, y=df$V3, z=df$V4, color = df$result)

Expand here to see past versions of dimred-4.png:

Version	Author	Date
8c55246	tk382	2018-07-11

Using Reference Panel

The SLSL function also has a hidden argument “ref”. This allows the users to utilize the reference panel data sets (H. Li et al. 2017). Two panels are available: cell types and tissue types. One can simply add this parameter in SLSL function.

Read Data

To demonstrate the use of the reference panel, we use the cell types of human embryonic cells published in (Chu et al. 2016).

load('../data/Chu_celltype.Rdata')
X = Chu_celltype$X
out_cell_ref = SLSL(X, 
                    ref="cell",
                    dir = "..", 
                    numClust = 7)

heat(out_cell_ref$projection)

Expand here to see past versions of ref_analyze-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

One can observe that in the cell atlas panel, ESC (embryonic stem cells) tend to show high correlation, which is as expected. Now, take a look at the heatmap of the similarity matrix of the projection matrix, and evaluate the clustering result.

truelabel = as.factor(Chu_celltype$label)
heat(out_cell_ref$SLSL_output$S)

Expand here to see past versions of ref_analyze2-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

table(out_cell_ref$SLSL_output$result, truelabel)

   truelabel
    DEC  EC  H1  H9 HFF NPC  TB
  1   0   0   0   0   0 173   0
  2   0   0   0   0 159   0   0
  3   1 104   0   0   0   0   0
  4 137   1   0   0   0   0   0
  5   0   0  97  77   0   0   0
  6   0   0 115  85   0   0   0
  7   0   0   0   0   0   0  69

adj.rand.index(out_cell_ref$SLSL_output$result, 
               as.numeric(truelabel))

[1] 0.7377856

Now look at the visualization from the dimension reduction result.

out = out_cell_ref$SLSL_output
tsne = out$tsne
df = as.data.frame(tsne)
df$truelabel = as.factor(truelabel)
df$result = as.factor(out$result)
ggplot(df, aes(x=V3, y=V2, col=truelabel)) + geom_point()

Expand here to see past versions of ref_analyze3-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

ggplot(df, aes(x=V3, y=V2, col=result)) + geom_point()

Expand here to see past versions of ref_analyze3-2.png:

Version	Author	Date
8c55246	tk382	2018-07-11

Similarly, we can use the tissue reference panel.

out_tissue_ref = SLSL(X, ref="tissue", dir = "..", numClust=7)

We can analyze the result like below.

heat(out_tissue_ref$projection)

Expand here to see past versions of tissueref_analyze-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

heat(out_tissue_ref$SLSL_output$S)

Expand here to see past versions of tissueref_analyze-2.png:

Version	Author	Date
8c55246	tk382	2018-07-11

table(out_tissue_ref$SLSL_output$result, truelabel)

   truelabel
    DEC  EC  H1  H9 HFF NPC  TB
  1   0   0   0   0   0 172   0
  2   1  91   0   0   0   0   0
  3 134  14   1   0   0   1  15
  4   0   0   0   0 159   0   0
  5   3   0   0   0   0   0  54
  6   0   0  96  65   0   0   0
  7   0   0 115  97   0   0   0

adj.rand.index(out_tissue_ref$SLSL_output$result,
               as.numeric(truelabel))

[1] 0.6894567

out = out_tissue_ref$SLSL_output
tsne = out$tsne
df = as.data.frame(tsne)
df$truelabel = as.factor(truelabel)
df$result = as.factor(out$result)
ggplot(df, aes(x=V3, y=V2, col=truelabel)) + geom_point()

Expand here to see past versions of tissueref_analyze-3.png:

Version	Author	Date
8c55246	tk382	2018-07-11

ggplot(df, aes(x=V3, y=V2, col=result)) + geom_point()

Expand here to see past versions of tissueref_analyze-4.png:

Version	Author	Date
8c55246	tk382	2018-07-11

Since using reference panels often throws away a lot of information, users should be careful when to use panels and which panel to use. The data sets should be from human cells, and they should have enough cellular heterogeneity. For example, Yan data are from different stages of human embryonic cells, and once projected on to the reference panels, all the cells belonged to one type : oocytes. The Chu data above also shows a much better performance when clustered without the reference panel.

Large Sets (cells > 1000)

There is a separate function for data sets that have more than 1,000 cells : LSLSL(Large Similarity Learning with Scaled Lasso).

The input data matrix is the same as regular SLSL: a count matrix with cells in columns and genes in rows. LSLSL assumes that the number of columns exceed 1000.

Read Data

load('../data/Zeisel.Rdata')
X = Zeisel$X
truelabel = Zeisel$label

Note that Zeisel is a Drop-seq data, and coverage is very low. The histogram of the detection rate is below.

Parameter Set-Up

detection_rate = 1-colSums(X==0)/nrow(X)
hist(detection_rate, breaks = 20)

Expand here to see past versions of zeisel_coverage-1.png:

Version	Author	Date
8c55246	tk382	2018-07-11

Since we have the true label data, we will use the true number of clusters,

numClust = length(unique(truelabel)) #default is NA

For the kernel_type, we will use “combined” as before.

kernel_type = "combined" #one of "spearman", "pearson", or "euclidean"

LSLSL uses “parallel” package in CRAN for parallel computing. Users can specify the number of cores to use. The default is NA, and LSLSL will detect the number of cores of your laptop and use everything except 1. Here, we specify it as 2.

core = 3 #default is NA

The shuffle argument decides if LSLSL should shuffle the order of the cells. LSLSL divides the cells into groups of approximately the same size, and it works the best if each group contains all subtypes of the cells. Unless the user carefully designed the ordering of the cells so that they are well mixed, we recommend to set shuffle as TRUE.

shuffle = TRUE #default is TRUE

Now run the algorithm with different methods.

res = LSLSL(X,
            numClust      = numClust,
            kernel_type   = kernel_type,
            core          = core,
            shuffle       = shuffle,
            cluster_method = "CSPA")

Analyze the Result

adj.rand.index(truelabel, res$result)

[1] 0.6628879

Let’s try all the available consensus clustering methods in LSLSL function. The result is stored, so there is no need to run it again.

# construct an array called "out" for specification in diceR package.
final = res$array
out = array(0, dim=c(nrow(final), ncol(final), 1, 1))
out[,,1,1] = final
k = length(unique(truelabel))

#dimension names
dimnames(out)[[1]] = paste0('R',1:nrow(final)) #random row names
dimnames(out)[[2]] = paste0('C',1:ncol(final)) #random column names
dimnames(out)[[3]] = paste0("k")
dimnames(out)[[4]] = as.character(k)
  
#imputation
newX = log(genefilter(X)+1)
E = impute_missing(out, t(newX), 9)
kmodes = k_modes(E, 9)
majvote = majority_voting(E, 9)
lce = LCE(E, 9)
adj.rand.index(kmodes, truelabel)

[1] 0.2859902

adj.rand.index(majvote, truelabel)

[1] 0.2931143

adj.rand.index(lce, truelabel)

[1] 0.2931143

Session information

sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] diceR_0.5.0          scatterplot3d_0.3-41 stargazer_5.2.2     
 [4] abind_1.4-5          broom_0.4.4          gplots_3.0.1        
 [7] Rtsne_0.13           igraph_1.2.1         pracma_2.1.4        
[10] fossil_0.3.7         shapefiles_0.7       foreign_0.8-70      
[13] maps_3.3.0           sp_1.2-7             caret_6.0-79        
[16] lattice_0.20-35      reshape_0.8.7        dplyr_0.7.4         
[19] ggplot2_2.2.1        irlba_2.3.2          Matrix_1.2-14       
[22] quadprog_1.5-5       matrixStats_0.53.1   inline_0.3.14       

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2          class_7.3-14             
 [3] mclust_5.4                rprojroot_1.3-2          
 [5] RcppArmadillo_0.8.400.0.0 rstudioapi_0.7           
 [7] clue_0.3-55               DRR_0.0.3                
 [9] prodlim_2018.04.18        lubridate_1.7.4          
[11] codetools_0.2-15          splines_3.5.0            
[13] R.methodsS3_1.7.1         mnormt_1.5-5             
[15] robustbase_0.93-0         knitr_1.20               
[17] RcppRoll_0.2.2            workflowr_1.0.1          
[19] ddalpha_1.3.2             cluster_2.0.7-1          
[21] kernlab_0.9-25            R.oo_1.22.0              
[23] sfsmisc_1.1-2             shiny_1.0.5              
[25] compiler_3.5.0            backports_1.1.2          
[27] assertthat_0.2.0          lazyeval_0.2.1           
[29] later_0.7.1               htmltools_0.3.6          
[31] tools_3.5.0               bindrcpp_0.2.2           
[33] gtable_0.2.0              glue_1.2.0               
[35] reshape2_1.4.3            Rcpp_0.12.16             
[37] gdata_2.18.0              nlme_3.1-137             
[39] iterators_1.0.9           psych_1.8.3.3            
[41] timeDate_3043.102         gower_0.1.2              
[43] stringr_1.3.0             mime_0.5                 
[45] miniUI_0.1.1.1            gtools_3.5.0             
[47] DEoptimR_1.0-8            MASS_7.3-49              
[49] scales_0.5.0              ipred_0.9-6              
[51] promises_1.0.1            yaml_2.1.19              
[53] rpart_4.1-13              stringi_1.1.7            
[55] highr_0.6                 klaR_0.6-14              
[57] foreach_1.4.4             caTools_1.17.1           
[59] lava_1.6.1                geometry_0.3-6           
[61] rlang_0.2.0               pkgconfig_2.0.1          
[63] bitops_1.0-6              evaluate_0.10.1          
[65] purrr_0.2.4               bindr_0.1.1              
[67] recipes_0.1.2             labeling_0.3             
[69] CVST_0.2-1                tidyselect_0.2.4         
[71] plyr_1.8.4                magrittr_1.5             
[73] R6_2.2.2                  combinat_0.0-8           
[75] dimRed_0.1.0              pillar_1.2.2             
[77] whisker_0.3-2             withr_2.1.2              
[79] survival_2.41-3           nnet_7.3-12              
[81] tibble_1.4.2              questionr_0.6.2          
[83] KernSmooth_2.23-15        rmarkdown_1.9            
[85] grid_3.5.0                git2r_0.21.0             
[87] ModelMetrics_1.1.0        digest_0.6.15            
[89] xtable_1.8-2              tidyr_0.8.0              
[91] httpuv_1.4.1              R.utils_2.6.0            
[93] stats4_3.5.0              munsell_0.4.3            
[95] magic_1.5-8

Chu, Li-Fang, Ning Leng, Jue Zhang, Zhonggang Hou, Daniel Mamott, David T Vereide, Jeea Choi, Christina Kendziorski, Ron Stewart, and James A Thomson. 2016. “Single-Cell Rna-Seq Reveals Novel Regulators of Human Embryonic Stem Cell Differentiation to Definitive Endoderm.” Genome Biology 17 (1). BioMed Central: 173.

Li, Huipeng, Elise T Courtois, Debarka Sengupta, Yuliana Tan, Kok Hao Chen, Jolene Jie Lin Goh, Say Li Kong, et al. 2017. “Reference Component Analysis of Single-Cell Transcriptomes Elucidates Cellular Heterogeneity in Human Colorectal Tumors.” Nature Genetics 49 (5). Nature Publishing Group: 708.

Yan, Liying, Mingyu Yang, Hongshan Guo, Lu Yang, Jun Wu, Rong Li, Ping Liu, et al. 2013. “Single-Cell Rna-Seq Profiling of Human Preimplantation Embryos and Embryonic Stem Cells.” Nature Structural and Molecular Biology 20 (9). Nature Publishing Group: 1131.

This reproducible R Markdown analysis was created with workflowr 1.0.1

How to Use SLSL, Case by Case

Load the SLSL library.

Small, good data sets

Read Data

Parameter Set-Up

Run the Function

Analyze the Result

Using Reference Panel

Read Data

Large Sets (cells > 1000)

Read Data

Parameter Set-Up

Analyze the Result

Session information