MASH v FLASH introduction

Last updated: 2018-06-20

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180609)

The command set.seed(20180609) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: abf0f70
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/images/.DS_Store
    Ignored:    output/.DS_Store
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	abf0f70	Jason Willwerscheid	2018-06-20	workflowr::wflow_publish(c(“analysis/intro.Rmd”,
html	558656a	Jason Willwerscheid	2018-06-20	Build site.
Rmd	9a419c2	Jason Willwerscheid	2018-06-20	workflowr::wflow_publish(“analysis/intro.Rmd”)
html	bdf2579	Jason Willwerscheid	2018-06-19	Build site.
Rmd	a5fd6f4	Jason Willwerscheid	2018-06-19	wflow_publish(“analysis/intro.Rmd”)
html	e092b87	Jason Willwerscheid	2018-06-18	Build site.
Rmd	8d6ce20	Jason Willwerscheid	2018-06-18	wflow_publish(“analysis/intro.Rmd”)

Introduction

This vignette shows how flashr can be used to learn something about the covariance structure of a dataset.

Motivation, part I: flash and covariance

Recall that flashr fits the model \[\begin{aligned} Y &= X + E \\ X &= LF' \end{aligned}\] where \(Y\) (the “observed” data) and \(X\) (the “true” effects) are \(n \times p\) matrices. \(L\) is an \(n \times k\) matrix of loadings, \(F\) is a \(p \times k\) matrix of factors, and \(E\) is an \(n \times p\) matrix of residuals. Denote the columns of \(L\) as \(\ell_1, \dots, \ell_k\) and the columns of \(F\) as \(f_1, \ldots, f_k\).

Notice that the fitted model does not only tell us about how the elements of \(L\) and \(F\) are distributed; it also tells us something about how the elements of \(X\) covary.

In particular, if \(L\) is regarded as fixed (by, for example, fixing each \(L_{ij}\) at its posterior mean), then one may write \[ X_{:, j} = F_{j 1} \ell_1 + \ldots + F_{j k} \ell_k \] so that, if the loadings \(\ell_1, \ldots, \ell_k\) are mutually orthogonal (in general, they are not), \[ \begin{aligned} \text{Cov} (X_{:, j}) &= (\mathbb{E} F_{j1}^2) \ell_1 \ell_1' + \ldots + (\mathbb{E} F_{jk}^2) \ell_k \ell_k' \end{aligned}\]

In other words, the covariance of the columns of \(X\) will be a linear combination (or, more precisely, a conical combination) of the rank-one covariance matrices \(\ell_i \ell_i'\).

Motivation, part II: covariance mixtures

One can take this idea a bit further by recalling that the priors on factors \(f_1, \ldots, f_k\) are mixture distributions. For simplicity, assume that the priors are point-normal distributions \[f_m \sim (1 - \pi_m) \delta_0 + \pi_m N(0, \tau_m^2)\] with \(0 \le \pi_m \le 1\).

By augmenting the data with hidden variables \(I_{jm}\) that indicate the mixture components from which the elements \(F_{jm}\) are drawn, one may equivalently write \[\begin{aligned} F_{jm} \mid I_{jm} = 0 &\sim \delta_0 \\ F_{jm} \mid I_{jm} = 1 &\sim N(0, \tau_m^2) \\ I_{jm} &\sim \text{Bernoulli}(\pi_m) \end{aligned}\] so that \(I_{jm} = 1\) if gene \(j\) is “active” for factor \(m\) and \(I_{jm} = 0\) otherwise.

Now, if one regards the variables \(I_{jm}\) (as well as \(L\)) as fixed, then one obtains \[ \begin{aligned} \text{Cov} (X_{:, j}) &= I_{j 1} \tau_1^2 \ell_1 \ell_1' + \ldots + I_{j_m} \tau_k^2 \ell_k \ell_k' \end{aligned}\]

In other words, depending on the values of \(I_{j1}, \ldots, I_{jm}\), the elements of column \(X_{:, j}\) will covary in one of \(2^m\) possible ways.

Relation to mash

In fact, if the priors on factors \(f_1, \ldots, f_k\) are arbitrary mixtures of normals (including the point-normal priors discussed in the previous section), then fixing \(L\) (and again assuming that the columns of \(L\) are mutually orthogonal) results in the columns of \(X\) being distributed (exchangeably) as a mixture of multivariate normals. In the point-normal case, \[ X_{:, j} \sim \sum_{r = 1}^{2^m} N_n(0, \Sigma_r), \] where each \(\Sigma_r\) is a conical combination of the rank-one covariance matrices \(\ell_1 \ell_1', \ldots, \ell_m \ell_m'\).

mashr (see here) similarly models \(X\) as a mixture of multivariate normals, so it makes sense to attempt to use flashr (which is, on its face, much simpler than mashr) to similar ends.

Canonical covariance structures

In addition to a set of covariance matrices that are fit to the data, mashr includes a set of “canonical” covariance matrices. For example, it is reasonable to expect that some effects will be unique to a single condition \(i\). For the corresponding columns of \(X\), the covariance matrix will have a single non-zero entry (namely, the \(i\)th diagonal entry, \(\Sigma_{ii}\)).

One can accommodate such effects in flashr by adding \(n\) fixed one-hot vectors as loadings (one for each condition). In other words, we can add “canonical” covariance structures corresponding to effects that are unique in a single condition by fitting the model \[ X = \begin{pmatrix} L & I_n \end{pmatrix} F'. \]

By the same logic as above, this model should be able to accommodate conical combinations of the matrices \(e_1 e_1', \ldots, e_n e_n'\) (where \(e_i\) is the \(i\)th canonical basis vector). In particular, it should be able to model effects that are independent across conditions, or unique to a few conditions, as well as effects that are unique to a single condition.

Fixing standard errors

Now the full model is \[\begin{aligned} Y &= \begin{pmatrix} L & I_n \end{pmatrix} F' + E \\ \end{aligned}\]

Notice that this approach only makes sense if the standard errors for \(E\) are considered fixed. Writing \[ F = \begin{pmatrix} F_1' \\ F_2' \end{pmatrix}\] gives \[ Y = LF_1' + F_2' + E, \] so that, for example, putting independent \(N(0, 1)\) priors on the elements of \(F_2\) and \(E\) is equivalent to putting a \(\delta_0\) prior on the elements of \(F_2\) and independent \(N(0, 2)\) priors on the residuals.

To fix standard errors in flashr, it is necessary to pass the data into flash_set_data using parameter S to specify standard errors. Further, calls to flash (and flash_add_greedy and flash_backfit) must specify parameter option var_type = "zero", which indicates that standard errors should not be estimated, but should be considered fixed.

Fitting the flash object

Finally, the recommended procedure is as follows:

Create a flash data object, using parameter S to specify standard errors.
Add the “data-driven” loadings to the flash fit object greedily, using parameter option var_type = "zero" to indicate that the standard errors should be considered fixed.
Add \(n\) fixed one-hot loadings vectors to the flash fit object.
Refine the flash fit object via backfitting, again using parameter option var_type = "zero". The parameter option nullcheck = FALSE should also be used so that the canonical covariance structures are retained.

Example with simulated data

The first code chunk simulates a \(10 \times 400\) data matrix where a quarter of columns \(X_{:, j}\) are null across all conditions, a quarter are nonnull in the first condition only, a quarter are nonnull in all conditions with effect sizes that are independent across conditions, and a quarter are nonnull in all conditions with an effect size that is identical across conditions.

set.seed(1)
n <- 5
p <- 400

ncond <- n # n
nsamp <- as.integer(p / 4)

# Null effects:
X.zero = matrix(0, nrow=ncond, ncol=nsamp)
# Effects that occur only in condition 1:
X.one = X.zero
b2 = 5 * rnorm(nsamp)
X.one[1,] = b2
# Independent effects:
X.ind = matrix(5 * rnorm(ncond*nsamp), nrow=ncond, ncol=nsamp) 
b = 5 * rnorm(nsamp)
# Effects that are identical across conditions:
X.ident = matrix(rep(b, ncond), nrow=ncond, ncol=nsamp, byrow=T)

X = cbind(X.zero, X.one, X.ind, X.ident)

E = matrix(rnorm(4*ncond*nsamp), nrow=ncond, ncol=4*nsamp)
Y = X + E

The next code chunk fits a flash object using the procedure described in the previous section.

#library(flashr)
devtools::load_all("/Users/willwerscheid/GitHub/flashr2/")
data <- flash_set_data(Y, S = 1)
fl_greedy <- flash_add_greedy(data, Kmax = 10, var_type = "zero")
I_n <- diag(rep(1, n))
fl_fixedl <- flash_add_fixed_l(data, I_n, fl_greedy)
fl_backfit <- flash_backfit(data, fl_fixedl, var_type = "zero", nullcheck = F)

Finally, the following code chunk calculates the mean squared error obtained using the flash fit object posterior means, relative to the mean squared error that would be obtained by simply estimating \(X\) using the data matrix \(Y\).

baseline_mse <- sum((Y - X)^2)/(n * p)

fl_preds <- flash_get_lf(fl_backfit)
flash_mse <- sum((fl_preds - X)^2)/(n * p)
flash_mse / baseline_mse

[1] 0.5224411

To verify that flashr has in fact learned something about how the data covaries, one can collapse the data into a vector and use ashr to fit a prior that is a univariate mixture of normals.

ash_fit <- ashr::ash(betahat = as.vector(Y), sebetahat = 1)
ash_pm <- ashr::get_pm(ash_fit)
ash_preds <- matrix(ash_pm, nrow=n, ncol=p)
ash_mse <- sum((ash_preds - X)^2)/(n * p)
ash_mse / baseline_mse

[1] 0.7572497

So, the flashr estimates are much better, even though flashr uses point-normal priors rather than the more flexible class of normal mixtures used by ashr.

FLASH v MASH

For this particular simulation, mashr outperforms flashr in terms of MSE:

library(mashr)
data <- mash_set_data(t(Y))
U.c = cov_canonical(data)
m.1by1 <- mash_1by1(data)
strong <- get_significant_results(m.1by1, 0.05)
U.pca <- cov_pca(data, 5, strong)
U.ed <- cov_ed(data, U.pca, strong)
U <- c(U.c, U.ed)
m <- mash(data, U)

mash_mse <- sum((t(get_pm(m)) - X)^2)/(n * p)
mash_mse / baseline_mse

[1] 0.4161114

This is as expected, since the data were after all generated from the mash model. More generally, however, the two methods often perform quite similarly. See the simulation study here and a comparison using GTEx data here.

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mashr_0.2-7  ashr_2.2-7   flashr_0.5-8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17             pillar_1.2.1            
 [3] plyr_1.8.4               compiler_3.4.3          
 [5] git2r_0.21.0             workflowr_1.0.1         
 [7] R.methodsS3_1.7.1        R.utils_2.6.0           
 [9] iterators_1.0.9          tools_3.4.3             
[11] testthat_2.0.0           digest_0.6.15           
[13] tibble_1.4.2             evaluate_0.10.1         
[15] memoise_1.1.0            gtable_0.2.0            
[17] lattice_0.20-35          rlang_0.2.0             
[19] Matrix_1.2-12            foreach_1.4.4           
[21] commonmark_1.4           yaml_2.1.17             
[23] parallel_3.4.3           mvtnorm_1.0-7           
[25] ebnm_0.1-11              withr_2.1.1.9000        
[27] stringr_1.3.0            roxygen2_6.0.1.9000     
[29] xml2_1.2.0               knitr_1.20              
[31] REBayes_1.2              devtools_1.13.4         
[33] rprojroot_1.3-2          grid_3.4.3              
[35] R6_2.2.2                 rmarkdown_1.8           
[37] rmeta_3.0                ggplot2_2.2.1           
[39] magrittr_1.5             whisker_0.3-2           
[41] backports_1.1.2          scales_0.5.0            
[43] codetools_0.2-15         htmltools_0.3.6         
[45] MASS_7.3-48              assertthat_0.2.0        
[47] softImpute_1.4           colorspace_1.3-2        
[49] stringi_1.1.6            Rmosek_7.1.3            
[51] lazyeval_0.2.1           munsell_0.4.3           
[53] doParallel_1.0.11        pscl_1.5.2              
[55] truncnorm_1.0-8          SQUAREM_2017.10-1       
[57] ExtremeDeconvolution_1.3 R.oo_1.21.0

This reproducible R Markdown analysis was created with workflowr 1.0.1