Last updated: 2018-06-20
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180609)
The command set.seed(20180609)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: abf0f70
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/images/.DS_Store
Ignored: output/.DS_Store
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | abf0f70 | Jason Willwerscheid | 2018-06-20 | workflowr::wflow_publish(c(“analysis/intro.Rmd”, |
html | 558656a | Jason Willwerscheid | 2018-06-20 | Build site. |
Rmd | 9a419c2 | Jason Willwerscheid | 2018-06-20 | workflowr::wflow_publish(“analysis/intro.Rmd”) |
html | bdf2579 | Jason Willwerscheid | 2018-06-19 | Build site. |
Rmd | a5fd6f4 | Jason Willwerscheid | 2018-06-19 | wflow_publish(“analysis/intro.Rmd”) |
html | e092b87 | Jason Willwerscheid | 2018-06-18 | Build site. |
Rmd | 8d6ce20 | Jason Willwerscheid | 2018-06-18 | wflow_publish(“analysis/intro.Rmd”) |
This vignette shows how flashr
can be used to learn something about the covariance structure of a dataset.
Recall that flashr
fits the model \[\begin{aligned}
Y &= X + E \\
X &= LF'
\end{aligned}\] where \(Y\) (the “observed” data) and \(X\) (the “true” effects) are \(n \times p\) matrices. \(L\) is an \(n \times k\) matrix of loadings, \(F\) is a \(p \times k\) matrix of factors, and \(E\) is an \(n \times p\) matrix of residuals. Denote the columns of \(L\) as \(\ell_1, \dots, \ell_k\) and the columns of \(F\) as \(f_1, \ldots, f_k\).
Notice that the fitted model does not only tell us about how the elements of \(L\) and \(F\) are distributed; it also tells us something about how the elements of \(X\) covary.
In particular, if \(L\) is regarded as fixed (by, for example, fixing each \(L_{ij}\) at its posterior mean), then one may write \[ X_{:, j} = F_{j 1} \ell_1 + \ldots + F_{j k} \ell_k \] so that, if the loadings \(\ell_1, \ldots, \ell_k\) are mutually orthogonal (in general, they are not), \[ \begin{aligned} \text{Cov} (X_{:, j}) &= (\mathbb{E} F_{j1}^2) \ell_1 \ell_1' + \ldots + (\mathbb{E} F_{jk}^2) \ell_k \ell_k' \end{aligned}\]
In other words, the covariance of the columns of \(X\) will be a linear combination (or, more precisely, a conical combination) of the rank-one covariance matrices \(\ell_i \ell_i'\).
One can take this idea a bit further by recalling that the priors on factors \(f_1, \ldots, f_k\) are mixture distributions. For simplicity, assume that the priors are point-normal distributions \[f_m \sim (1 - \pi_m) \delta_0 + \pi_m N(0, \tau_m^2)\] with \(0 \le \pi_m \le 1\).
By augmenting the data with hidden variables \(I_{jm}\) that indicate the mixture components from which the elements \(F_{jm}\) are drawn, one may equivalently write \[\begin{aligned} F_{jm} \mid I_{jm} = 0 &\sim \delta_0 \\ F_{jm} \mid I_{jm} = 1 &\sim N(0, \tau_m^2) \\ I_{jm} &\sim \text{Bernoulli}(\pi_m) \end{aligned}\] so that \(I_{jm} = 1\) if gene \(j\) is “active” for factor \(m\) and \(I_{jm} = 0\) otherwise.
Now, if one regards the variables \(I_{jm}\) (as well as \(L\)) as fixed, then one obtains \[ \begin{aligned} \text{Cov} (X_{:, j}) &= I_{j 1} \tau_1^2 \ell_1 \ell_1' + \ldots + I_{j_m} \tau_k^2 \ell_k \ell_k' \end{aligned}\]
In other words, depending on the values of \(I_{j1}, \ldots, I_{jm}\), the elements of column \(X_{:, j}\) will covary in one of \(2^m\) possible ways.
In fact, if the priors on factors \(f_1, \ldots, f_k\) are arbitrary mixtures of normals (including the point-normal priors discussed in the previous section), then fixing \(L\) (and again assuming that the columns of \(L\) are mutually orthogonal) results in the columns of \(X\) being distributed (exchangeably) as a mixture of multivariate normals. In the point-normal case, \[ X_{:, j} \sim \sum_{r = 1}^{2^m} N_n(0, \Sigma_r), \] where each \(\Sigma_r\) is a conical combination of the rank-one covariance matrices \(\ell_1 \ell_1', \ldots, \ell_m \ell_m'\).
mashr
(see here) similarly models \(X\) as a mixture of multivariate normals, so it makes sense to attempt to use flashr
(which is, on its face, much simpler than mashr
) to similar ends.
In addition to a set of covariance matrices that are fit to the data, mashr
includes a set of “canonical” covariance matrices. For example, it is reasonable to expect that some effects will be unique to a single condition \(i\). For the corresponding columns of \(X\), the covariance matrix will have a single non-zero entry (namely, the \(i\)th diagonal entry, \(\Sigma_{ii}\)).
One can accommodate such effects in flashr
by adding \(n\) fixed one-hot vectors as loadings (one for each condition). In other words, we can add “canonical” covariance structures corresponding to effects that are unique in a single condition by fitting the model \[ X = \begin{pmatrix} L & I_n \end{pmatrix} F'. \]
By the same logic as above, this model should be able to accommodate conical combinations of the matrices \(e_1 e_1', \ldots, e_n e_n'\) (where \(e_i\) is the \(i\)th canonical basis vector). In particular, it should be able to model effects that are independent across conditions, or unique to a few conditions, as well as effects that are unique to a single condition.
Now the full model is \[\begin{aligned} Y &= \begin{pmatrix} L & I_n \end{pmatrix} F' + E \\ \end{aligned}\]
Notice that this approach only makes sense if the standard errors for \(E\) are considered fixed. Writing \[ F = \begin{pmatrix} F_1' \\ F_2' \end{pmatrix}\] gives \[ Y = LF_1' + F_2' + E, \] so that, for example, putting independent \(N(0, 1)\) priors on the elements of \(F_2\) and \(E\) is equivalent to putting a \(\delta_0\) prior on the elements of \(F_2\) and independent \(N(0, 2)\) priors on the residuals.
To fix standard errors in flashr
, it is necessary to pass the data into flash_set_data
using parameter S
to specify standard errors. Further, calls to flash
(and flash_add_greedy
and flash_backfit
) must specify parameter option var_type = "zero"
, which indicates that standard errors should not be estimated, but should be considered fixed.
Finally, the recommended procedure is as follows:
Create a flash data object, using parameter S
to specify standard errors.
Add the “data-driven” loadings to the flash fit object greedily, using parameter option var_type = "zero"
to indicate that the standard errors should be considered fixed.
Add \(n\) fixed one-hot loadings vectors to the flash fit object.
Refine the flash fit object via backfitting, again using parameter option var_type = "zero"
. The parameter option nullcheck = FALSE
should also be used so that the canonical covariance structures are retained.
The first code chunk simulates a \(10 \times 400\) data matrix where a quarter of columns \(X_{:, j}\) are null across all conditions, a quarter are nonnull in the first condition only, a quarter are nonnull in all conditions with effect sizes that are independent across conditions, and a quarter are nonnull in all conditions with an effect size that is identical across conditions.
set.seed(1)
n <- 5
p <- 400
ncond <- n # n
nsamp <- as.integer(p / 4)
# Null effects:
X.zero = matrix(0, nrow=ncond, ncol=nsamp)
# Effects that occur only in condition 1:
X.one = X.zero
b2 = 5 * rnorm(nsamp)
X.one[1,] = b2
# Independent effects:
X.ind = matrix(5 * rnorm(ncond*nsamp), nrow=ncond, ncol=nsamp)
b = 5 * rnorm(nsamp)
# Effects that are identical across conditions:
X.ident = matrix(rep(b, ncond), nrow=ncond, ncol=nsamp, byrow=T)
X = cbind(X.zero, X.one, X.ind, X.ident)
E = matrix(rnorm(4*ncond*nsamp), nrow=ncond, ncol=4*nsamp)
Y = X + E
The next code chunk fits a flash object using the procedure described in the previous section.
#library(flashr)
devtools::load_all("/Users/willwerscheid/GitHub/flashr2/")
data <- flash_set_data(Y, S = 1)
fl_greedy <- flash_add_greedy(data, Kmax = 10, var_type = "zero")
I_n <- diag(rep(1, n))
fl_fixedl <- flash_add_fixed_l(data, I_n, fl_greedy)
fl_backfit <- flash_backfit(data, fl_fixedl, var_type = "zero", nullcheck = F)
Finally, the following code chunk calculates the mean squared error obtained using the flash fit object posterior means, relative to the mean squared error that would be obtained by simply estimating \(X\) using the data matrix \(Y\).
baseline_mse <- sum((Y - X)^2)/(n * p)
fl_preds <- flash_get_lf(fl_backfit)
flash_mse <- sum((fl_preds - X)^2)/(n * p)
flash_mse / baseline_mse
[1] 0.5224411
To verify that flashr
has in fact learned something about how the data covaries, one can collapse the data into a vector and use ashr
to fit a prior that is a univariate mixture of normals.
ash_fit <- ashr::ash(betahat = as.vector(Y), sebetahat = 1)
ash_pm <- ashr::get_pm(ash_fit)
ash_preds <- matrix(ash_pm, nrow=n, ncol=p)
ash_mse <- sum((ash_preds - X)^2)/(n * p)
ash_mse / baseline_mse
[1] 0.7572497
So, the flashr
estimates are much better, even though flashr
uses point-normal priors rather than the more flexible class of normal mixtures used by ashr
.
For this particular simulation, mashr
outperforms flashr
in terms of MSE:
library(mashr)
data <- mash_set_data(t(Y))
U.c = cov_canonical(data)
m.1by1 <- mash_1by1(data)
strong <- get_significant_results(m.1by1, 0.05)
U.pca <- cov_pca(data, 5, strong)
U.ed <- cov_ed(data, U.pca, strong)
U <- c(U.c, U.ed)
m <- mash(data, U)
mash_mse <- sum((t(get_pm(m)) - X)^2)/(n * p)
mash_mse / baseline_mse
[1] 0.4161114
This is as expected, since the data were after all generated from the mash model. More generally, however, the two methods often perform quite similarly. See the simulation study here and a comparison using GTEx data here.
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mashr_0.2-7 ashr_2.2-7 flashr_0.5-8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 pillar_1.2.1
[3] plyr_1.8.4 compiler_3.4.3
[5] git2r_0.21.0 workflowr_1.0.1
[7] R.methodsS3_1.7.1 R.utils_2.6.0
[9] iterators_1.0.9 tools_3.4.3
[11] testthat_2.0.0 digest_0.6.15
[13] tibble_1.4.2 evaluate_0.10.1
[15] memoise_1.1.0 gtable_0.2.0
[17] lattice_0.20-35 rlang_0.2.0
[19] Matrix_1.2-12 foreach_1.4.4
[21] commonmark_1.4 yaml_2.1.17
[23] parallel_3.4.3 mvtnorm_1.0-7
[25] ebnm_0.1-11 withr_2.1.1.9000
[27] stringr_1.3.0 roxygen2_6.0.1.9000
[29] xml2_1.2.0 knitr_1.20
[31] REBayes_1.2 devtools_1.13.4
[33] rprojroot_1.3-2 grid_3.4.3
[35] R6_2.2.2 rmarkdown_1.8
[37] rmeta_3.0 ggplot2_2.2.1
[39] magrittr_1.5 whisker_0.3-2
[41] backports_1.1.2 scales_0.5.0
[43] codetools_0.2-15 htmltools_0.3.6
[45] MASS_7.3-48 assertthat_0.2.0
[47] softImpute_1.4 colorspace_1.3-2
[49] stringi_1.1.6 Rmosek_7.1.3
[51] lazyeval_0.2.1 munsell_0.4.3
[53] doParallel_1.0.11 pscl_1.5.2
[55] truncnorm_1.0-8 SQUAREM_2017.10-1
[57] ExtremeDeconvolution_1.3 R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1