Last updated: 2018-02-27
Code version: be7f846
Generate data with missing value:
Some rows have very sparse observations.
library(mashr)
Loading required package: ashr
set.seed(2018)
data = simple_sims(500, ncond = 60, err_sd = 0.5)
data.miss = data
missing.row = sample(1:nrow(data$Bhat), nrow(data$Bhat)/4)
# choose number of missing for each row
missing.num = sample(50:59, length(missing.row), replace = TRUE)
missing.ind = matrix(0, nrow = nrow(data$B), ncol=ncol(data$B))
for(i in 1: length(missing.row)){
col = sample(1:60, missing.num[i])
missing.ind[missing.row[i], col] = 1
}
missing = which(missing.ind == 1)
data.miss$Bhat[missing] = NA
data.miss$Shat[missing] = NA
missing1 = is.na(data.miss$Bhat[,1])
Set the missing value with large standard deviation
data.large.dev = data.miss
data.large.dev$Bhat[is.na(data.miss$Bhat)] = 0
data.large.dev$Shat[is.na(data.miss$Shat)] = 1000
ash
ash_res = ash(data.large.dev$Bhat[,1], data.large.dev$Shat[,1], mixcompdist = "normal", alpha = 0)
which(ash_res$result$lfsr[missing1] < 0.05)
integer(0)
ash_res_orig = ash(data.large.dev$Bhat[!missing1,1], data.large.dev$Shat[!missing1,1], mixcompdist = "normal", alpha = 0)
sum((ash_res$result$PosteriorMean[!missing1] - ash_res_orig$result$PosteriorMean)^2)
[1] 1.09632e-11
The missing data are estimated as 0 and non-significant using ash
. Since the missing data was assigned with large error, the analysis ignore those missing values. The estimates for the non-missing ones are similar.
mash
mash.data.missing = mash_set_data(Bhat=data.large.dev$Bhat, Shat=data.large.dev$Shat)
U.c = cov_canonical(mash.data.missing)
mash.model.missing = mash(mash.data.missing, U.c)
- Computing 2000 x 1106 likelihood matrix.
- Likelihood calculations took 41.36 seconds.
- Fitting model with 1106 mixture components.
- Model fitting took 19.91 seconds.
- Computing posterior matrices.
- Computation allocated took 21.42 seconds.
barplot(get_estimated_pi(mash.model.missing), las=2, cex.names = 0.7)
The weights learned from the data is correct, even there are many missing values. There are 888 significant samples, 2071 significent effects in missing data.
Some of the missing value is estimated with significant posterior mean. This is caused by the information borrowing from other observations. Based on the covariance structure learned from the whole dataset, the missing value are imputed.
We can check the RMSE, comparing the posterior mean for the missing data with the true value.
# RMSE
sqrt(sum(((mash.model.missing$result$PosteriorMean[missing] - data$B[missing])^2))/length(missing))
[1] 0.5383296
mash.data.missing.1 = mash_set_data(Bhat=data.large.dev$Bhat, Shat=data.large.dev$Shat, alpha = 1)
U.c = cov_canonical(mash.data.missing)
mash.model.missing.1 = mash(mash.data.missing.1, U.c)
- Computing 2000 x 1106 likelihood matrix.
- Likelihood calculations took 6.58 seconds.
- Fitting model with 1106 mixture components.
Warning in REBayes::KWDual(A, rep(1, k), normalize(w), control = control): estimated mixing distribution has some negative values:
consider reducing rtol
Warning in mixIP(matrix_lik = structure(c(0.239881584685685,
0.339137348690074, : Optimization step yields mixture weights that are
either too small, or negative; weights have been corrected and renormalized
after the optimization.
- Model fitting took 24.21 seconds.
- Computing posterior matrices.
- Computation allocated took 9.73 seconds.
FIXME: 'compute_posterior_matrices' in Rcpp does not transfer EZ to EE
barplot(get_estimated_pi(mash.model.missing.1), las=2, cex.names = 0.7)
The weights learned from the data are very weird and incorrect. There are 763 significant samples, 0 significent effects in missing data. The maximum of the estimated posterior mean is 240.0490705 with estimated sd 737.2168498. This result is not what we expected. In the EZ model, multiplying back the standard errors causes the problem, since the missing data are set with large standard error.
So, one lesson is that we should use EE model when there are missing values in the data.
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mashr_0.2-4 ashr_2.2-4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 knitr_1.17 magrittr_1.5
[4] REBayes_1.2 MASS_7.3-47 doParallel_1.0.11
[7] pscl_1.5.2 SQUAREM_2017.10-1 lattice_0.20-35
[10] foreach_1.4.4 plyr_1.8.4 stringr_1.2.0
[13] tools_3.4.3 parallel_3.4.3 grid_3.4.3
[16] rmeta_2.16 git2r_0.20.0 htmltools_0.3.6
[19] iterators_1.0.9 assertthat_0.2.0 yaml_2.1.16
[22] rprojroot_1.2 digest_0.6.13 Matrix_1.2-12
[25] codetools_0.2-15 evaluate_0.10.1 rmarkdown_1.8
[28] stringi_1.1.6 compiler_3.4.3 Rmosek_8.0.69
[31] backports_1.1.2 mvtnorm_1.0-7 truncnorm_1.0-7
This R Markdown site was created with workflowr