Last updated: 2018-11-05

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(20180714)

    The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 1eda9af

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    docs/.DS_Store
        Ignored:    docs/figure/.DS_Store
    
    Untracked files:
        Untracked:  analysis/gd_notes.Rmd
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    Rmd 1eda9af Jason Willwerscheid 2018-11-05 wflow_publish(“analysis/matrix_ops.Rmd”)


Updating \(\sigma^2\)

Let \(Y\) have zeroes where data is missing and let \(Z\) be an indicator matrix with ones where data is non-missing and zeroes where data is missing. Write

\[ \begin{aligned} R^2 = Y^2 &- 2 Y \odot (EL) (EF)' + Z \odot ((EL) (EF)')^2 \\ &+ Z \odot (EL^2) (EF^2)' - Z \odot (EL)^2 ((EF)^2)' \end{aligned} \]

var_type = constant

Let \(m\) be the number of non-missing entries in \(Y\). We can write

\[\begin{aligned} \sigma^2 = \frac{1}{m} ( \text{sum}(Y^2) &- 2 \cdot \text{sum}((EL) \odot Y (EF)) + \text{sum}((EL) \odot Z(EF))^2 \\ &+ \text{sum}((EL^2) \odot Z (EF^2)) - \text{sum}((EL)^2 \odot Z (EF)^2) ) \end{aligned}\]

The most expensive operations are the four multiplications of an \(n \times p\) matrix by a \(p \times k\) matrix. The three involving \(Z\) are very cheap if there is no missing data. The old implementation required three multiplications of a \(n \times k\) by a \(k \times p\) matrix, so there is not likely to be any speedup unless there is missing data. Much more importantly, though, the new implementation does not require the formation of any new \(n \times p\) matrices.

var_type = by_row

Now let \(m\) be an \(n\)-vector, where \(m_i\) denotes the number of non-missing entries in row \(Y_{i \bullet}\). \(\sigma^2\) is now an \(n\)-vector:

\[\begin{aligned} \sigma^2 = (1 / m) \odot (\text{rowSums}(Y^2) &- 2 \cdot \text{rowSums}((EL) \odot Y (EF)) + \text{rowSums}((EL) \odot Z(EF))^2 \\ &+ \text{rowSums}((EL^2) \odot Z (EF^2)) - \text{rowSums}((EL)^2 \odot Z (EF)^2)) \end{aligned}\]

var_type = by_column

With \(m\) a \(p\)-vector, where \(m_j\) denotes the number of non-missing entries in column \(Y_{\bullet j}\):

\[\begin{aligned} \sigma^2 = (1 / m) \odot (\text{colSums}(Y^2) &- 2 \cdot \text{rowSums}(t(EF) \odot (t(EL))Y) + \text{rowSums}(t(EF) \odot (t(EL))Z)^2 \\ &+ \text{rowSums}(t(EF^2) \odot (t(EL^2))Z)) - \text{rowSums}(t(EF)^2 \odot t(EL)^2 Z)) \end{aligned}\]

(I write it in this form to avoid transposing \(Y\).)

Greedy updates

In the course of optimizing a single factor/loading pair, one can exploit the fact that only one column in each of \(EL\), \(EF\), \(EL^2\), and \(EF^2\) change.

Updating loadings

Instead of updating loading 1, then factor 1, then loading 2, etc., one can much more efficiently update all loadings (potentially in parallel), then all factors.

Calculating the EBNM s2s

The old algorithm calculates s2 = 1/(tau %*% f$EF2[, k]), where the entries of tau corresponding to missing data have been set to zero.

Let \(S\) be the matrix of s2s, with the \(i\)th column giving the s2s for the \(i\)th loading. Then for var_type = constant or var_type = by_row, \(S\) may be calculated as \[ S = \sigma^2 / Z(EF^2), \] where “\(/\)” is elementwise for var_type = by_row. Again the calculation is simplified if there is no missing data: here, \(Z(EF^2)\) is simply a matrix where each row is the \(k\)-vector \(\text{colSums}(EF^2)\).

For var_type = by_column, \[ S = 1 / Z(\tau \odot_b EF^2), \] where \(\odot_b\) denotes that broadcasting is to be performed (in the sense that R uses it).

This is essentially the same as before, but tau does not need to be stored as a matrix.

Calculating the EBNM xs

The old algorithm calculates x = ((Rk * tau) %*% f$EF[, k]) * s2, which requires storing and updating Rk. For var_type = constant and var_type = by_row, one may write:

\[ X_{\bullet k} = \tau \odot (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (EF)_{\bullet k} \odot S_{\bullet k}. \]

Note that \((Z \odot (EL)_{\bullet i} (EF)_{\bullet i}') (EF)_{\bullet k}\) can be written as \[ (EL)_{\bullet i} \odot Z ((EF)_{\bullet i} \odot (EF)_{\bullet k}) \] so \[\left(\sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}'\right) (EF)_{\bullet k} = \text{rowSums} \left((EL)_{\bullet -k} \odot Z((EF)_{\bullet k} \odot_b (EF)_{\bullet -k}) \right).\]

Again, the performance savings are not likely to be substantial unless there is no missing data, but this method does not require the formation of any new \(n \times p\) matrices.

For var_type = by_column, write \[ X_{\bullet k} = (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (\tau \odot (EF)_{\bullet k}) \odot S_{\bullet k}. \]

An idea for parallelizing the EBNM problems

The above could also be implemented as follows:

  1. Calculate the \(n \times k\) matrix \(W = (Y - Z \odot (EL) (EF)')(EF)\). When there is missing data, this does require the temporary formation of a new \(n \times p\) matrix, but it does not need to be stored.

  2. \((Z \odot (EL)_{\bullet k} (EF)_{\bullet k}') (EF)_{\bullet k}\) can be written as \[ (EL)_{\bullet k} \odot Z ((EF)_{\bullet k}^2). \] Form the \(n \times k\) matrix \(U = (EL) \odot Z(EF)^2\) in one go.

  3. For \(k = 1\), calculate \[ X_{\bullet 1} = \tau \odot (W_{\bullet 1} + U_{\bullet 1}) \odot S_{\bullet 1}\] and solve the EBNM problem. This changes \((EL)_{\bullet k}\), so \(W\) needs to be updated:

\[\begin{aligned} W^{\text{new}} &= W^{\text{old}} + \left( Z \odot ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) (EF)_{\bullet k}' \right) (EF) \\ &= W^{\text{old}} + ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) \odot Z((EF)_{\bullet k} \odot_b (EF)), \end{aligned}\]

Then repeat for \(k = 2, 3, \ldots\)

But if one simply omits the update of of \(W\), parallelization is simple. If the updates of \((EL)\) are small, then this omission might be justified.

Updating factors

Factor updates can easily be obtained by reversing the roles of, on the one hand, \(EL\) and \(EF\) and, on the other, var_type = by_row and var_type = by_column.

Summary

With this implementation, the only \(n \times p\) matrices that ever need to be handled are \(Y\) and, if data is missing, \(Z\) (and \(Z\) can be stored as a matrix of integers). So it should be possible to fit a flash object using only slightly more memory than that required to store the data itself.

Session information

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.19      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.2.4     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.8     tools_3.4.3      
[16] stringr_1.3.0     yaml_2.1.17       compiler_3.4.3   
[19] htmltools_0.3.6   knitr_1.20       

This reproducible R Markdown analysis was created with workflowr 1.0.1