Last updated: 2018-03-22

Code version: 0e50352

The purpose of this analysis is to explore the strand specificity of my net seq data. Within genes we expect strand specific expression and at TSS we would expect some convergent transcription. I will filter the overlaping genes as done in the Meyer paper.

Mayer filtering: Pol-II-transcribed genes that do not overlap other genes within 2.5 kb of the TSS and polyadenylation site and are longer than 2 kb

Filter the genes

bedtools closest -N

Get the protein coding genes:

grep 'protein_coding' gencode.v19.annotation.bed  | awk '$8 == "gene"' |cut -f1-6  |sed 's/^chr//' >  gencode.v19.annotation.proteincodinggene.bed 

Only keep genes that are 2kb. I want genes where end (3) - start (2) is greater than 2000

awk '{if( ($3 - $2) > 2000)  print($1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6)}' gencode.v19.annotation.proteincodinggene.bed > gencode.v19.annotation.proteincodinggene.2kb.bed

Next step is to look for overlab distance with bedtools closest -N.

Session information

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.2  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
 [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.15   
 [9] stringi_1.1.6   rmarkdown_1.8.5 knitr_1.18      git2r_0.21.0   
[13] stringr_1.2.0   digest_0.6.14   evaluate_0.10.1

This R Markdown site was created with workflowr