Last updated: 2018-06-16
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180609)
The command set.seed(20180609)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: d8b6331
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/images/.DS_Store
Untracked files:
Untracked: output/gtexfllfsr.rds
Unstaged changes:
Modified: analysis/index.Rmd
Modified: code/gtex.R
Modified: output/gtexmfit.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | d8b6331 | Jason Willwerscheid | 2018-06-16 | analysis/index.Rmd |
Here I analyze some GTEx data. The dataset can be found at https://stephenslab.github.io/gtexresults/.
I used the same methods to fit the data that I used in my simulation study. These methods assume that noise is independent among conditions. It is not, but it is still useful to see how the methods compare when applied to a real dataset.
The simulation study suggested that the “one-hots last” method produces a much better fit than the “one-hots first” method, even though it can take quite a bit longer. Here I enter into some more detail.
First I load the data and the fits.
gtex <- readRDS("./data/MatrixEQTLSumStats.Portable.Z.rds")
data <- gtex$test.z
data <- t(data)
fl_data <- flash_set_data(data, S = 1)
gtex_mfit <- readRDS("./output/gtexmfit.rds")
gtex_flfit <- readRDS("./output/gtexflfit.rds")
gtex_flfit2 <- readRDS("./output/gtexflfit2.rds")
gtex_flfit3 <- readRDS("./output/gtexflfit3.rds")
The OHL fit was produced by greedily adding a total of 31 factors, then adding 44 fixed one-hot factors (one per condition), then backfitting the whole thing. The objective attained was -1259284.
The OHF fit added the 44 fixed one-hot factors, then backfit them, then added only 2 (!) more factors greedily. The resulting objective was much worse than that of the OHL fit, at -1635669.
Finally, I tried applying an additional backfitting step to the OHF fit to see how much the objective improved (I call this method “FLASH-OHF+” in the plot below). The final objective was -1306242: better, but still not as good as the OHL fit.
It seems clear that the OHL method is the way to go. However, it does take a long time (over twice as long as MASH):
data <- c(gtex_mfit$timing$ed, gtex_mfit$timing$mash,
gtex_flfit$timing$greedy, gtex_flfit$timing$backfit,
gtex_flfit2$timing$greedy, gtex_flfit2$timing$backfit,
gtex_flfit3$timing$greedy, gtex_flfit3$timing$backfit)
time_units <- units(data)
data <- matrix(as.numeric(data), 2, 4)
barplot(data, axes=T,
main=paste("Average time to fit in", time_units),
names.arg = c("MASH", "FL-OHL", "FL-OHF", "FL-OHF+"),
legend.text = c("ED/Greedy", "MASH/Backfit"),
ylim = c(0, max(colSums(data))*1.5))
The posterior means are quite similar (correlation coefficient = 0.98):
Next I look at confusion matrices for gene-condition pairs that are declared significant at a given LFSR threshold. For each threshold, MASH and FLASH agree in approximately 85-87% of cases. (But is it really true that )
m_lfsr <- t(get_lfsr(gtex_mfit$m))
fl_lfsr <- readRDS("./output/gtexfllfsr.rds")
confusion_matrix <- function(t) {
mash_signif <- m_lfsr <= t
flash_signif <- fl_lfsr <= t
round(table(mash_signif, flash_signif)
/ length(mash_signif), digits=3)
}
At 5%:
confusion_matrix(.05)
flash_signif
mash_signif FALSE TRUE
FALSE 0.407 0.116
TRUE 0.036 0.441
At 1%:
confusion_matrix(.01)
flash_signif
mash_signif FALSE TRUE
FALSE 0.523 0.076
TRUE 0.056 0.346
At 0.1%:
confusion_matrix(.001)
flash_signif
mash_signif FALSE TRUE
FALSE 0.585 0.088
TRUE 0.048 0.279
Click the button to view the code used to obtain the above results.
devtools::load_all("/Users/willwerscheid/GitHub/flashr2/")
library(mashr)
gtex <- readRDS("./data/MatrixEQTLSumStats.Portable.Z.rds")
data <- gtex$test.z
data <- t(data)
fl_data <- flash_set_data(data, S = 1)
source("./code/fits.R")
source("./code/sims.R")
gtex_mfit <- fit_mash(data)
saveRDS(gtex_mfit, "./output/gtexmfit.rds")
gtex_flfit <- fit_flash(data, Kmax = 40, add_onehots_first = FALSE)
saveRDS(gtex_flfit, "./output/gtexflfit.rds")
obj1 <- flash_get_objective(fl_data, gtex_flfit$fl) # -1259284
# Try OHF method of fitting FLASH object and compare likelihoods
gtex_flfit2 <- fit_flash(data, Kmax = 40, add_onehots_first = TRUE)
saveRDS(gtex_flfit2, "./output/gtexflfit2.rds")
obj2 <- flash_get_objective(fl_data, gtex_flfit2$fl) # -1635669
# Now do an additional backfit on the OHF fit
gtex_flfit3 <- list()
t0 <- Sys.time()
gtex_flfit3$fl <- flash_backfit(fl_data, gtex_flfit2$fl, var_type = "zero",
nullcheck = F, verbose = T)
t <- Sys.time() - t0
gtex_flfit3$timing <- gtex_flfit2$timing
gtex_flfit3$timing$backfit <- gtex_flfit3$timing$backfit + t
gtex_flfit3$timing$total <- gtex_flfit3$timing$total + t
saveRDS(gtex_flfit3, "./output/gtexflfit3.rds")
obj3 <- flash_get_objective(fl_data, gtex_flfit3$fl) # -1306242
# Use PM from each method as "true Y" and do diagnostics
# fl_pm <- flash_get_lf(gtex_flfit$fl)
# gtex_mres <- mash_diagnostics(gtex_mfit$m, fl_pm)
# saveRDS(gtex_mres, "./output/gtexmres.rds")
#
# m_pm <- t(get_pm(gtex_mfit$m))
# gtex_flres <- flash_diagnostics(gtex_flfit$fl, data, m_pm, nsamp = 200)
# saveRDS(gtex_flres, "./output/gtexflres.rds")
# Plot FLASH PM vs. MASH PM
fl_pm <- flash_get_lf(gtex_flfit$fl)
m_pm <- t(get_pm(gtex_mfit$m))
png("./output/gtexcompare.png")
plot(as.vector(fl_pm), as.vector(m_pm), xlab="FLASH PM", ylab="MASH PM",
main="Posterior means on GTEx data", pch='.')
dev.off()
corr <- cor(as.vector(fl_pm), as.vector(m_pm))
# Use LFSR to get "significant" effects and get confusion matrices
m_lfsr <- t(get_lfsr(gtex_mfit$m))
fl_sampler <- flash_lf_sampler(data, gtex_flfit$fl, ebnm_fn=ebnm_pn, fixed="loadings")
fl_lfsr <- flash_lfsr(fl_sampler(200))
saveRDS(fl_lfsr, "./output/gtexfllfsr.rds")
confusion_matrix <- function(t) {
mash_signif <- m_lfsr <= t
flash_signif <- fl_lfsr <= t
round(table(mash_signif, flash_signif)
/ length(mash_signif), digits=3)
}
confusion_matrix(.05)
confusion_matrix(.01)
confusion_matrix(.001)
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mashr_0.2-7 ashr_2.2-7 flashr_0.5-8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 pillar_1.2.1 plyr_1.8.4
[4] compiler_3.4.3 git2r_0.21.0 workflowr_1.0.1
[7] R.methodsS3_1.7.1 R.utils_2.6.0 iterators_1.0.9
[10] tools_3.4.3 testthat_2.0.0 digest_0.6.15
[13] tibble_1.4.2 evaluate_0.10.1 memoise_1.1.0
[16] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0
[19] Matrix_1.2-12 foreach_1.4.4 commonmark_1.4
[22] yaml_2.1.17 parallel_3.4.3 mvtnorm_1.0-7
[25] ebnm_0.1-11 withr_2.1.1.9000 stringr_1.3.0
[28] roxygen2_6.0.1.9000 xml2_1.2.0 knitr_1.20
[31] devtools_1.13.4 rprojroot_1.3-2 grid_3.4.3
[34] R6_2.2.2 rmarkdown_1.8 rmeta_3.0
[37] ggplot2_2.2.1 magrittr_1.5 whisker_0.3-2
[40] backports_1.1.2 scales_0.5.0 codetools_0.2-15
[43] htmltools_0.3.6 MASS_7.3-48 assertthat_0.2.0
[46] softImpute_1.4 colorspace_1.3-2 stringi_1.1.6
[49] lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[52] pscl_1.5.2 truncnorm_1.0-8 SQUAREM_2017.10-1
[55] R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1