When only the most extreme observation is known

Last updated: 2018-05-12

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: ddf9062

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/BH_robustness_cache/
    Ignored:    analysis/FDR_Null_cache/
    Ignored:    analysis/FDR_null_betahat_cache/
    Ignored:    analysis/Rmosek_cache/
    Ignored:    analysis/StepDown_cache/
    Ignored:    analysis/alternative2_cache/
    Ignored:    analysis/alternative_cache/
    Ignored:    analysis/ash_gd_cache/
    Ignored:    analysis/average_cor_gtex_2_cache/
    Ignored:    analysis/average_cor_gtex_cache/
    Ignored:    analysis/brca_cache/
    Ignored:    analysis/cash_deconv_cache/
    Ignored:    analysis/cash_fdr_1_cache/
    Ignored:    analysis/cash_fdr_2_cache/
    Ignored:    analysis/cash_fdr_3_cache/
    Ignored:    analysis/cash_fdr_4_cache/
    Ignored:    analysis/cash_fdr_5_cache/
    Ignored:    analysis/cash_fdr_6_cache/
    Ignored:    analysis/cash_plots_cache/
    Ignored:    analysis/cash_sim_1_cache/
    Ignored:    analysis/cash_sim_2_cache/
    Ignored:    analysis/cash_sim_3_cache/
    Ignored:    analysis/cash_sim_4_cache/
    Ignored:    analysis/cash_sim_5_cache/
    Ignored:    analysis/cash_sim_6_cache/
    Ignored:    analysis/cash_sim_7_cache/
    Ignored:    analysis/correlated_z_2_cache/
    Ignored:    analysis/correlated_z_3_cache/
    Ignored:    analysis/correlated_z_cache/
    Ignored:    analysis/create_null_cache/
    Ignored:    analysis/cutoff_null_cache/
    Ignored:    analysis/design_matrix_2_cache/
    Ignored:    analysis/design_matrix_cache/
    Ignored:    analysis/diagnostic_ash_cache/
    Ignored:    analysis/diagnostic_correlated_z_2_cache/
    Ignored:    analysis/diagnostic_correlated_z_3_cache/
    Ignored:    analysis/diagnostic_correlated_z_cache/
    Ignored:    analysis/diagnostic_plot_2_cache/
    Ignored:    analysis/diagnostic_plot_cache/
    Ignored:    analysis/efron_leukemia_cache/
    Ignored:    analysis/fitting_normal_cache/
    Ignored:    analysis/gaussian_derivatives_2_cache/
    Ignored:    analysis/gaussian_derivatives_3_cache/
    Ignored:    analysis/gaussian_derivatives_4_cache/
    Ignored:    analysis/gaussian_derivatives_5_cache/
    Ignored:    analysis/gaussian_derivatives_cache/
    Ignored:    analysis/gd-ash_cache/
    Ignored:    analysis/gd_delta_cache/
    Ignored:    analysis/gd_lik_2_cache/
    Ignored:    analysis/gd_lik_cache/
    Ignored:    analysis/gd_w_cache/
    Ignored:    analysis/knockoff_10_cache/
    Ignored:    analysis/knockoff_2_cache/
    Ignored:    analysis/knockoff_3_cache/
    Ignored:    analysis/knockoff_4_cache/
    Ignored:    analysis/knockoff_5_cache/
    Ignored:    analysis/knockoff_6_cache/
    Ignored:    analysis/knockoff_7_cache/
    Ignored:    analysis/knockoff_8_cache/
    Ignored:    analysis/knockoff_9_cache/
    Ignored:    analysis/knockoff_cache/
    Ignored:    analysis/knockoff_var_cache/
    Ignored:    analysis/marginal_z_alternative_cache/
    Ignored:    analysis/marginal_z_cache/
    Ignored:    analysis/mosek_reg_2_cache/
    Ignored:    analysis/mosek_reg_4_cache/
    Ignored:    analysis/mosek_reg_5_cache/
    Ignored:    analysis/mosek_reg_6_cache/
    Ignored:    analysis/mosek_reg_cache/
    Ignored:    analysis/pihat0_null_cache/
    Ignored:    analysis/plot_diagnostic_cache/
    Ignored:    analysis/poster_obayes17_cache/
    Ignored:    analysis/real_data_simulation_2_cache/
    Ignored:    analysis/real_data_simulation_3_cache/
    Ignored:    analysis/real_data_simulation_4_cache/
    Ignored:    analysis/real_data_simulation_5_cache/
    Ignored:    analysis/real_data_simulation_cache/
    Ignored:    analysis/rmosek_primal_dual_2_cache/
    Ignored:    analysis/rmosek_primal_dual_cache/
    Ignored:    analysis/seqgendiff_cache/
    Ignored:    analysis/simulated_correlated_null_2_cache/
    Ignored:    analysis/simulated_correlated_null_3_cache/
    Ignored:    analysis/simulated_correlated_null_cache/
    Ignored:    analysis/simulation_real_se_2_cache/
    Ignored:    analysis/simulation_real_se_cache/
    Ignored:    analysis/smemo_2_cache/
    Ignored:    data/LSI/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
    Ignored:    output/fig/

Unstaged changes:
    Deleted:    analysis/cash_plots_fdp.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	cc0ab83	Lei Sun	2018-05-11	update
html	0f36d99	LSun	2017-12-21	Build site.
html	853a484	LSun	2017-11-07	Build site.
Rmd	1c0be20	LSun	2017-03-06	write-ups
html	1c0be20	LSun	2017-03-06	write-ups
Rmd	74d4201	LSun	2017-02-28	largest value
html	74d4201	LSun	2017-02-28	largest value

Last updated: 2018-05-12

Code version: ddf9062

Introducation

Up until now, truncash only uses a threshold that’s pre-specified, that is, independent with the data. So a natrual question is, what will happen if we choose a threshold that is data driven, such as the \(n^\text{th}\) most extreme observation or the top \(q\%\) quantile?

For a start, Matthew had an idea that what if the only thing we know is the most extreme observation \((\hat\beta_{(n)}, \hat s_{(n)})\), as well as the total number of observations \(n\). What does this single data point tell us?

Model

Start with our usual ash model.

\[ \begin{array}{c} \hat\beta_j | \hat s_j, \beta_j \sim N(\beta_j, \hat s_j^2)\\ \beta_j \sim \sum_k\pi_k N(0, \sigma_k^2) \end{array} \] Now we only observe \((\hat\beta_{(n)}, \hat s_{(n)})\) with the information that \(|\hat\beta_{(n)}/\hat s_{(n)}| \geq |\hat\beta_{j}/\hat s_{j}|\), \(j = 1, \ldots, n\). This is essentially separating \(n\) observations into two groups.

\[ \text{Group 1: }(\hat\beta_{(1)}, \hat s_{(1)}), \ldots, (\hat\beta_{(n - 1)}, \hat s_{(n - 1)}), \text{ with } |\hat\beta_j/\hat s_j| \leq t = |\hat\beta_{(n)}/\hat s_{(n)}| \] \[ \text{Group 2: }(\hat\beta_{n}, \hat s_{n}), \text{ with } |\hat\beta_{(n)}/\hat s_{(n)}| = t \] Or in other words, it should be related to truncash using the threshold \(t = |\hat\beta_{(n)}/\hat s_{(n)}|\), at least from the likelihood principle point of view.

Back-of-the-envelope calculation

Suppose \(X_1 \sim F_1, X_2\sim F_2, \ldots, X_n \sim F_n\), with \(F_i\) being the cdf of the random variable \(X_i\), with a pdf \(f_i\). In ash’s setting, we can think of \(X_i = |\hat\beta_i/ \hat s_i|\), and \(f_i\) is the convolution of a common unimodel distribution \(g\) (to be estimated) and the idiosyncratic likelihood of \(|\hat\beta_j / \hat s_j|\) given \(\hat s_j\) (usually related to normal or Student’s t, but could be generalized to others). Let \(X_{(n)}:=\max\{X_1, X_2, \ldots, X_n\}\), the extreme value of these \(n\) random variables.

\[ \begin{array}{rl} & P(X_{(n)} \leq t) = \prod_{i = 1}^n F_i(t) \\ \Rightarrow & p_{X_{(n)}}(t) = dP(X_{(n)} \leq t)/dt \neq \prod_{i = 1}^{n-1} F_i(t)f_n(t) \end{array} \] where \(\{1, \ldots, n-1\}\) are the index set of less extreme observations and \(n\) of the most extreme one. So these two statements are not equivalent.

The largest value in \(\{X_1, X_2, \ldots, X_n\}\) is \(t\).
We have \(n\) random variables and we only observe one; all others are less than it.

Special case

If we have \(F_1 = F_2 = \cdots = F_n\), the two statements are somehow indeed related because \[ \begin{array}{rl} & P(X_{(n)} \leq t) = (F(t))^n \\ \Rightarrow & p_{X_{(n)}}(t) = dP(X_{(n)} \leq t)/dt = n(F(t))^{n-1}f(t) \\ \propto & (F(t))^{n-1}f(t)\\ \end{array} \] In other words, we can regard “known the largest observation only” as equivalent to “using the largest observation as the threshold in truncash.”

\(F_1 = F_2 = \cdots = F_n\) in current setting implies that \(\hat\beta_j / \hat s_j\) has the same marginal distribution for every observation. Actually it’s not a wild assumption. For example, we always have

\[ \hat\beta_j / \hat s_j | \beta_j, s_j, \nu_j \sim t_{\nu_j}(\beta_j / s_j) \] If we further assume

\[ \beta_j / s_j \sim g \] then we’ll arrive at the result that \(\hat\beta_j / \hat s_j\) has the same marginal distribution. This assumption is essentially the gold standard everybody implicitly makes, refered to as \(\alpha = 1\) assumption in ash.

Session Information

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.16      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.1.6     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.9     tools_3.4.3      
[16] stringr_1.3.0     yaml_2.1.18       compiler_3.4.3   
[19] htmltools_0.3.6   knitr_1.20

This reproducible R Markdown analysis was created with workflowr 1.0.1