Last updated: 2018-08-23
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(20180714) 
The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: 9778769 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
Untracked files:
    Untracked:  data/greedy19.rds
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | 9778769 | Jason Willwerscheid | 2018-08-23 | wflow_publish(“analysis/arbitraryV.Rmd”) | 
Here I examine whether it is possible to fit a FLASH model with an arbitrary error covariance matrix using an idea suggested here by Matthew Stephens.
That is, I want to fit the model \[ Y = LF' + E,\] where the columns of \(E\) are distributed i.i.d. \[ E_{\bullet j} \sim N(0, V). \] Equivalently, letting \(\lambda_{min}\) be the smallest eigenvalue of \(V\) and letting \(W = V - \lambda_{min} I_n\) (so that, in particular, \(W\) is positive semi-definite), write \[ Y = LF' + E^{(1)} + E^{(2)}, \] with the columns of \(E^{(1)}\) distributed i.i.d. \[ E^{(1)}_{\bullet j} \sim N(0, W) \] and the elements of \(E^{(2)}\) distributed i.i.d. \[ E^{(2)}_{i j} \sim N(0, \lambda_{min}) \] Notice that by taking the eigendecomposition of \(W\) \[ W = \sum_{k = 1}^n \lambda_k w_k w_k' \] and letting \[ f_i \sim N(0, \lambda_i), \] one can write \[ E^{(1)}_{\bullet j} = w_1 f_1' + \ldots + w_n f_n'. \]
Thus, one should be able to fit the desired model by adding fixed loadings \(w_1, \ldots, w_n\), by fixing the priors on the corresponding factors at \(N(0, \lambda_1), \ldots, N(0, \lambda_n)\), and by taking \(\tau = 1 / \lambda_{min}\) (with var_type = "zero").
First I need a function that will generate random covariance matrices. I normalize the matrices so that the largest eigenvalue is equal to one. Further, I ensure that the smallest eigenvalue is bounded below by some constant. (If the covariance matrix is poorly conditioned, then the final backfit can be very slow, and in practice, we would not expect these eigenvalues to be terribly small.)
rand.V <- function(n, lambda.min=0.25) {
  A <- matrix(rnorm(n^2), nrow=n, ncol=n)
  V <- A %*% t(A)
  max.eigen <- max(eigen(V, symmetric=TRUE, only.values=TRUE)$values)
  d <- max.eigen * lambda.min / (1 - lambda.min)
  # Add diagonal matrix to improve conditioning and then normalize:
  V <- (V + diag(rep(d, n))) / (max.eigen + d)
  return(V)
}The next function simulates data from the rank-zero FLASH model \(Y = E\), with \(E_{\bullet j} \sim^{i.i.d.} N(0, V)\).
sim.E <- function(V, p) {
  n <- nrow(V)
  return(t(MASS::mvrnorm(p, rep(0, n), V)))
}The following function fits a FLASH model using the approach outlined above.
fit.fixed.V <- function(Y, V, verbose=TRUE, backfit=FALSE, tol=1e-2) {
  n <- nrow(V)
  lambda.min <- min(eigen(V, symmetric=TRUE, only.values=TRUE)$values)
  
  data <- flash_set_data(Y, S = sqrt(lambda.min))
  
  W.eigen <- eigen(V - diag(rep(lambda.min, n)), symmetric=TRUE)
  # The rank of W is at most n - 1, so we can drop the last eigenval/vec:
  W.eigen$values <- W.eigen$values[-n]
  W.eigen$vectors <- W.eigen$vectors[, -n, drop=FALSE]
  
  fl <- flash_add_fixed_loadings(data, LL=W.eigen$vectors, init_fn="udv_svd")
  
  ebnm_param_f <- lapply(as.list(W.eigen$values), 
                         function(eigenval) {
                           list(g = list(a=1/eigenval, pi0=0), fixg = TRUE)
                         })
  ebnm_param_l <- lapply(vector("list", n - 1), 
                         function(k) {list()})
  fl <- flash_backfit(data, fl, var_type="zero", ebnm_fn="ebnm_pn",
                      ebnm_param=(list(f = ebnm_param_f, l = ebnm_param_l)),
                      nullcheck=FALSE, verbose=verbose, tol=tol)
  
  fl <- flash_add_greedy(data, Kmax=50, f_init=fl, var_type="zero",
                         init_fn="udv_svd", ebnm_fn="ebnm_pn", 
                         verbose=verbose, tol=tol)
  
  if (backfit) {
    n.added <- flash_get_k(fl) - (n - 1)
    
    ebnm_param_f <- c(ebnm_param_f, 
                      lapply(vector("list", n.added), 
                             function(k) {list(warmstart=TRUE)}))
    ebnm_param_l <- c(ebnm_param_l, 
                      lapply(vector("list", n.added), 
                             function(k) {list(warmstart=TRUE)}))
    fl <- flash_backfit(data, fl, var_type="zero", ebnm_fn="ebnm_pn",
                      ebnm_param=(list(f = ebnm_param_f, l = ebnm_param_l)),
                      nullcheck=FALSE, verbose=verbose, tol=tol)
  }
  
  return(fl)
}devtools::load_all("/Users/willwerscheid/GitHub/flashr/")Loading flashrdevtools::load_all("/Users/willwerscheid/GitHub/ebnm/")Loading ebnmn <- 20
p <- 500
set.seed(666)
V = rand.V(n=n)
Y <- sim.E(V, p=p)
fl <- fit.fixed.V(Y, V)Backfitting 19 factor/loading(s) (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1       -9917.69        Inf          2       -9917.69   0.00e+00Fitting factor/loading 20 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1       -9948.16        Inf          2       -9932.50   1.57e+01          3       -9917.69   1.48e+01          4       -9917.69   0.00e+00Performing nullcheck...  Deleting factor 20 increases objective by 4.66e-03. Factor zeroed out.  Nullcheck complete. Objective: -9917.69Here, after backfitting the fixed loadings corresponding to the eigenvectors of \(W\), FLASH (correctly) fails to find any additional structure in the data. In contrast, fitting FLASH without paying attention to the fact that \(V \ne I\) gives misleading results:
bad.fl <- flash_add_greedy(Y, Kmax=50)Fitting factor/loading 1 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10046.63        Inf          2      -10034.53   1.21e+01          3      -10033.07   1.46e+00          4      -10032.20   8.68e-01          5      -10031.47   7.23e-01          6      -10030.91   5.61e-01          7      -10030.58   3.38e-01          8      -10030.30   2.79e-01          9      -10029.90   4.00e-01         10      -10029.34   5.58e-01         11      -10028.99   3.52e-01         12      -10028.89   9.62e-02         13      -10028.86   2.89e-02         14      -10028.85   1.47e-02         15      -10028.84   9.14e-03Performing nullcheck...  Deleting factor 1 decreases objective by 3.89e+01. Factor retained.  Nullcheck complete. Objective: -10028.84Fitting factor/loading 2 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10000.11        Inf          2       -9988.22   1.19e+01          3       -9986.88   1.33e+00          4       -9986.23   6.52e-01          5       -9985.81   4.21e-01          6       -9985.49   3.16e-01          7       -9985.23   2.58e-01          8       -9985.01   2.21e-01          9       -9984.82   1.93e-01         10       -9984.65   1.68e-01         11       -9984.51   1.47e-01         12       -9984.37   1.32e-01         13       -9984.25   1.23e-01         14       -9984.13   1.21e-01         15       -9984.01   1.20e-01         16       -9983.89   1.16e-01         17       -9983.79   1.00e-01         18       -9983.72   7.57e-02         19       -9983.67   5.04e-02         20       -9983.64   3.13e-02         21       -9983.62   1.89e-02         22       -9983.61   1.14e-02         23       -9983.60   6.88e-03Performing nullcheck...  Deleting factor 2 decreases objective by 4.52e+01. Factor retained.  Nullcheck complete. Objective: -9983.6Fitting factor/loading 3 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1       -9976.87        Inf          2       -9963.19   1.37e+01          3       -9961.15   2.03e+00          4       -9960.17   9.81e-01          5       -9959.64   5.28e-01          6       -9959.14   5.03e-01          7       -9958.78   3.65e-01          8       -9958.62   1.57e-01          9       -9958.55   6.56e-02         10       -9958.52   3.34e-02         11       -9958.50   1.97e-02         12       -9958.49   1.26e-02         13       -9958.48   8.11e-03Performing nullcheck...  Deleting factor 3 decreases objective by 2.51e+01. Factor retained.  Nullcheck complete. Objective: -9958.48Fitting factor/loading 4 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1       -9981.09        Inf          2       -9965.90   1.52e+01          3       -9964.03   1.87e+00          4       -9963.31   7.24e-01          5       -9962.84   4.67e-01          6       -9962.46   3.87e-01          7       -9962.12   3.38e-01          8       -9961.82   2.97e-01          9       -9961.56   2.59e-01         10       -9961.34   2.24e-01         11       -9961.15   1.91e-01         12       -9960.99   1.60e-01         13       -9960.86   1.32e-01         14       -9960.75   1.08e-01         15       -9960.66   8.79e-02         16       -9960.59   7.14e-02         17       -9960.53   5.80e-02         18       -9960.48   4.72e-02         19       -9960.44   3.86e-02         20       -9960.41   3.16e-02         21       -9960.39   2.59e-02         22       -9960.37   2.13e-02         23       -9960.35   1.76e-02         24       -9960.33   1.46e-02         25       -9960.32   1.21e-02         26       -9960.31   1.00e-02         27       -9960.30   8.33e-03Performing nullcheck...  Deleting factor 4 increases objective by 1.82e+00. Factor zeroed out.  Nullcheck complete. Objective: -9958.48The following function simulates data from the rank-one FLASH model \(Y = \ell d f' + E\). pi0.l and pi0.f give the expected proportion of null entries in \(\ell\) and \(f\). Since \(\ell\) and \(f\) are normalized to have length one, \(d\) measures how large the factor/loading pair is, and thus, how easy it is to find (recall that \(V\) is normalized so that its largest eigenvalue is equal to one).
sim.rank1 <- function(V, p, pi0.l=0.5, pi0.f=0.8, d=5^2) {
  E <- sim.E(V, p)
  
  n <- nrow(V)
  # Nonnull entries of l and f are normally distributed:
  l <- rnorm(n) * rbinom(n, 1, 1 - pi0.l)
  # Nonnull entries of f are all equal:
  f <- rnorm(p) * rbinom(p, 1, 1 - pi0.f)
  # Normalize l and f:
  l <- l / sqrt(sum(l^2))
  f <- f / sqrt(sum(f^2))
  
  LF <- outer(l, f) * d
  
  return(list(Y = LF + E, l = l, f = f))
}Here, the procedure outlined above correctly finds the additional rank-one structure. Running FLASH as is, however, yields structure of higher rank:
set.seed(999)
V = rand.V(n=n)
data <- sim.rank1(V, p=p)
fl <- fit.fixed.V(data$Y, V)Backfitting 19 factor/loading(s) (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10626.81        Inf          2      -10626.81   0.00e+00Fitting factor/loading 20 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10296.26        Inf          2      -10294.19   2.07e+00          3      -10294.15   4.51e-02          4      -10294.14   1.80e-03Performing nullcheck...  Deleting factor 20 decreases objective by 3.33e+02. Factor retained.  Nullcheck complete. Objective: -10294.14Fitting factor/loading 21 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10326.73        Inf          2      -10299.97   2.68e+01          3      -10294.15   5.82e+00          4      -10294.15   0.00e+00Performing nullcheck...  Deleting factor 21 increases objective by 2.53e-03. Factor zeroed out.  Nullcheck complete. Objective: -10294.14bad.fl <- flash_add_greedy(data$Y, Kmax=50)Fitting factor/loading 1 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10290.91        Inf          2      -10276.06   1.48e+01          3      -10274.56   1.49e+00          4      -10274.09   4.75e-01          5      -10273.93   1.55e-01          6      -10273.88   4.93e-02          7      -10273.87   1.58e-02          8      -10273.86   5.21e-03Performing nullcheck...  Deleting factor 1 decreases objective by 1.71e+02. Factor retained.  Nullcheck complete. Objective: -10273.86Fitting factor/loading 2 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10229.69        Inf          2      -10218.30   1.14e+01          3      -10217.19   1.11e+00          4      -10216.49   6.96e-01          5      -10215.86   6.26e-01          6      -10215.26   6.00e-01          7      -10214.68   5.81e-01          8      -10214.12   5.61e-01          9      -10213.59   5.35e-01         10      -10213.08   5.03e-01         11      -10212.62   4.66e-01         12      -10212.19   4.26e-01         13      -10211.81   3.84e-01         14      -10211.47   3.41e-01         15      -10211.17   3.01e-01         16      -10210.90   2.63e-01         17      -10210.68   2.28e-01         18      -10210.48   1.97e-01         19      -10210.31   1.69e-01         20      -10210.16   1.46e-01         21      -10210.04   1.25e-01         22      -10209.93   1.08e-01         23      -10209.84   9.37e-02         24      -10209.75   8.13e-02         25      -10209.68   7.09e-02         26      -10209.62   6.21e-02         27      -10209.57   5.47e-02         28      -10209.52   4.84e-02         29      -10209.48   4.30e-02         30      -10209.44   3.83e-02         31      -10209.40   3.43e-02         32      -10209.37   3.09e-02         33      -10209.34   2.78e-02         34      -10209.32   2.51e-02         35      -10209.30   2.28e-02         36      -10209.28   2.06e-02         37      -10209.26   1.87e-02         38      -10209.24   1.70e-02         39      -10209.22   1.55e-02         40      -10209.21   1.41e-02         41      -10209.20   1.28e-02         42      -10209.19   1.16e-02         43      -10209.18   1.06e-02         44      -10209.17   9.59e-03Performing nullcheck...  Deleting factor 2 decreases objective by 6.47e+01. Factor retained.  Nullcheck complete. Objective: -10209.17Fitting factor/loading 3 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10173.09        Inf          2      -10161.29   1.18e+01          3      -10160.15   1.15e+00          4      -10159.47   6.78e-01          5      -10158.92   5.47e-01          6      -10158.46   4.61e-01          7      -10158.07   3.89e-01          8      -10157.75   3.25e-01          9      -10157.48   2.68e-01         10      -10157.26   2.17e-01         11      -10157.09   1.75e-01         12      -10156.95   1.39e-01         13      -10156.84   1.10e-01         14      -10156.75   8.74e-02         15      -10156.68   6.92e-02         16      -10156.63   5.47e-02         17      -10156.58   4.32e-02         18      -10156.55   3.50e-02         19      -10156.52   2.99e-02         20      -10156.49   2.66e-02         21      -10156.47   2.43e-02         22      -10156.44   2.25e-02         23      -10156.42   2.08e-02         24      -10156.40   1.89e-02         25      -10156.39   1.68e-02         26      -10156.37   1.43e-02         27      -10156.36   1.17e-02         28      -10156.35   9.28e-03Performing nullcheck...  Deleting factor 3 decreases objective by 5.28e+01. Factor retained.  Nullcheck complete. Objective: -10156.35Fitting factor/loading 4 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10146.50        Inf          2      -10134.17   1.23e+01          3      -10133.44   7.32e-01          4      -10133.29   1.47e-01          5      -10133.24   5.09e-02          6      -10133.22   2.32e-02          7      -10133.21   1.20e-02          8      -10133.20   6.59e-03Performing nullcheck...  Deleting factor 4 decreases objective by 2.31e+01. Factor retained.  Nullcheck complete. Objective: -10133.2Fitting factor/loading 5 (stop when difference in obj. is < 1.00e-02):  Iteration      Objective   Obj Diff          1      -10156.31        Inf          2      -10142.29   1.40e+01          3      -10141.41   8.88e-01          4      -10141.26   1.48e-01          5      -10141.22   4.25e-02          6      -10141.20   1.89e-02          7      -10141.19   1.09e-02          8      -10141.18   7.15e-03Performing nullcheck...  Deleting factor 5 increases objective by 7.98e+00. Factor zeroed out.  Nullcheck complete. Objective: -10133.2To check that the new approach gives reasonable results, one can calculate the angle between the estimated \(l\) and the true \(l\) (and likewise for \(f\)):
ldf <- flash_get_ldf(fl, drop_zero_factors=FALSE)
l.angle <- acos(abs(sum(ldf$l[, n] * data$l)))
f.angle <- acos(abs(sum(ldf$f[, n] * data$f)))
round(c(l.angle, f.angle), digits=2)[1] 0.43 0.37These results are not terrible, but an additional backfit can improve upon them:
fl.b <- fit.fixed.V(data$Y, V, verbose=FALSE, backfit=TRUE)
ldf <- flash_get_ldf(fl.b, drop_zero_factors=FALSE)
l.angle <- acos(abs(sum(ldf$l[, n] * data$l)))
f.angle <- acos(abs(sum(ldf$f[, n] * data$f)))
round(c(l.angle, f.angle), digits=2)[1] 0.16 0.33I include code below that can be used to verify that the above results are typical. Since they can take a long time, I do not run them here.
rank0.experiment <-function(ntests, n, p, lambda.min=0.25, seeds=1:ntests) {
  est.rank <- bad.rank <- rep(NA, ntests)
  
  for (i in 1:length(seeds)) {
    set.seed(i)
    V <- rand.V(n, lambda.min)
    Y <- sim.E(V, p)
    fl <- fit.fixed.V(Y, V, verbose=FALSE)
    
    k <- flash_get_k(fl)
    est.rank[i] <- k - (n - 1)
    
    bad.fl <- flash_add_greedy(Y, Kmax=50, verbose=FALSE)
    bad.rank[i] <- flash_get_nfactors(bad.fl)
  }
  
  return(list(est.rank = est.rank, bad.rank = bad.rank))
}
rank1.experiment <-function(ntests, n, p, lambda.min=0.25, d=5^2, 
                            seeds=1:ntests) {
  est.rank <- bad.rank <- rep(NA, ntests)
  l.angle <- f.angle <- rep(NA, ntests)
  
  for (i in 1:length(seeds)) {
    set.seed(i)
    V = rand.V(n, lambda.min)
    data <- sim.rank1(V, p, d=d)
    fl <- fit.fixed.V(data$Y, V, verbose=FALSE, backfit=TRUE)
    k <- flash_get_k(fl)
    est.rank[i] <- k - (n - 1)
    
    ldf <- flash_get_ldf(fl, drop_zero_factors=FALSE)
    if (est.rank[i] >= 1) {
      l.angle[i] <- acos(abs(sum(ldf$l[, n] * data$l)))
      f.angle[i] <- acos(abs(sum(ldf$f[, n] * data$f)))
    }
    
    bad.fl <- flash_add_greedy(data$Y, Kmax=50, verbose=FALSE)
    bad.rank[i] <- flash_get_nfactors(bad.fl)
  }
  return(list(est.rank = est.rank, bad.rank = bad.rank, 
              l.angle = l.angle, f.angle = f.angle))
}I have set parameters lambda.min and d favorably for this investigation. If lambda.min is closer to 1, then errors will be more nearly independent, and the usual FLASH model will not fare so poorly. It would be worthwhile to investigate whether the approach detailed here beats the usual FLASH fit in such cases.
Further, I have set d to be quite large. In the above simulations, the true loading and factor are each five times larger (in terms of Euclidean length) than the largest eigenvalue of the error covariance matrix. It would be interesting to see what the detection threshold is as a function of n, p, and lambda.min.
Finally, notice that when \(\lambda_{min} = 1\), the approach detailed above is just the usual FLASH fit, so both of these proposed investigations would help to establish some continuity between the two.
sessionInfo()R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] ebnm_0.1-13   flashr_0.5-14
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17        pillar_1.2.1        plyr_1.8.4         
 [4] compiler_3.4.3      git2r_0.21.0        workflowr_1.0.1    
 [7] R.methodsS3_1.7.1   R.utils_2.6.0       iterators_1.0.9    
[10] tools_3.4.3         testthat_2.0.0      digest_0.6.15      
[13] tibble_1.4.2        evaluate_0.10.1     memoise_1.1.0      
[16] gtable_0.2.0        lattice_0.20-35     rlang_0.2.0        
[19] Matrix_1.2-12       foreach_1.4.4       commonmark_1.4     
[22] yaml_2.1.17         parallel_3.4.3      withr_2.1.1.9000   
[25] stringr_1.3.0       roxygen2_6.0.1.9000 xml2_1.2.0         
[28] knitr_1.20          devtools_1.13.4     rprojroot_1.3-2    
[31] grid_3.4.3          R6_2.2.2            rmarkdown_1.8      
[34] ggplot2_2.2.1       ashr_2.2-10         magrittr_1.5       
[37] whisker_0.3-2       backports_1.1.2     scales_0.5.0       
[40] codetools_0.2-15    htmltools_0.3.6     MASS_7.3-48        
[43] assertthat_0.2.0    softImpute_1.4      colorspace_1.3-2   
[46] stringi_1.1.6       lazyeval_0.2.1      munsell_0.4.3      
[49] doParallel_1.0.11   pscl_1.5.2          truncnorm_1.0-8    
[52] SQUAREM_2017.10-1   R.oo_1.21.0        This reproducible R Markdown analysis was created with workflowr 1.0.1