Last updated: 2018-08-21
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(20180801) 
The command set.seed(20180801) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: 843d7d1 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
Untracked files:
    Untracked:  com_threeprime.Rproj
    Untracked:  data/dist_upexon/
    Untracked:  data/liftover/
    Untracked:  data/map.stats.csv
    Untracked:  data/map.stats.xlsx
    Untracked:  docs/figure/
Unstaged changes:
    Modified:   _workflowr.yml
    Deleted:    comparitive_threeprime.Rproj
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes. 
I will use this analysis to create a pipeline I can use to liftover the peaks once I get them from the human and chimp three prime seq data.
Tool to add to conda environment:
Chain file from UCSC:
/project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/hg38ToPanTro5.over.chain.gz
/project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/panTro5ToHg38.over.chain.gz
I want the bed files with the peaks to be in the folowing format:
chr# start end species_peakname
Resulting bed files will go in: /project2/gilad/briana/comparitive_threeprime/data/liftover
To go from the peak bed file created in the callPeaksbySpecies analysis I need to cut the file to the first four columns and add the species name to the peak.
awk '{print $1 "\t" $2 "\t" $3 "\t" "human_"$4}' /project2/gilad/briana/comparitive_threeprime/human/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom_named_human.bed > /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeaks.bed
awk '{print $1 "\t" $2 "\t" $3 "\t" "chimp_"$4}' /project2/gilad/briana/comparitive_threeprime/chimp/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom_named_chimp.bed > /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.bed
Run liftOver with ‘liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed’ I want to run it both direction. I will then lift back.
LiftForward.sh
#!/bin/bash
#SBATCH --job-name=LiftForward
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=LiftForward.out
#SBATCH --error=LiftForward.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END
module load Anaconda3
source activate comp_threeprime_env
#human to chimp
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeaks.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/hg38ToPanTro5.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeakslifted.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeaksunlifted.bed
 
#chimp to human 
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/panTro5ToHg38.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.lifted.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.unlifted.bed
LiftReverse.sh
Now the lifted human peaks are on chimp cordinates and vise-versa. I will lift back over.
#!/bin/bash
#SBATCH --job-name=LiftReverse
#SBATCH --time=24:00:00
#SBATCH --output=LiftReverse.out
#SBATCH --error=LiftReverse.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END
module load Anaconda3
source activate comp_threeprime_env
#human to chimp back to human
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeakslifted.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/panTro5ToHg38.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeakslifted_reverse.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeaksunlifted.reverse.bed
#chimp to human back to chimp
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.lifted.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/hg38ToPanTro5.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.lifted.reverse.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.unlifted.reverse.bed
LiftFinal.sh
I now have lifted back and I want to go forward one more time to get the final list i should use.
#!/bin/bash
#SBATCH --job-name=LiftFinal
#SBATCH --time=24:00:00
#SBATCH --output=LiftFinal.out
#SBATCH --error=LiftFinal.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END
module load Anaconda3
source activate comp_threeprime_env
#human to chimp back to human - final lift to chimp
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeakslifted_reverse.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/hg38ToPanTro5.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeakslifted_reverse.finalCcords.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_humanPeaksunlifted.reverse.final.bed
#chimp to human back to chimp- final lift to human
liftOver /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.lifted.reverse.bed /project2/gilad/briana/genome_anotation_data/comp_genomes/liftover/panTro5ToHg38.over.chain.gz /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.lifted.reverse.finalHcords.bed /project2/gilad/briana/comparitive_threeprime/data/liftover/filtered_chimpPeaks.unlifted.reverse.final.bed
-change min percentage
The final lifted files have 350111 peaks that lifted from chimp to human and 442100 peaks that lifter human to chimp. I next will find the intersection of these files for the final list. In order to do this I need to create files that have the coordinates in human and in chimp. I can do this using the reverse file and final file.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
    filter, lag
The following objects are masked from 'package:base':
    intersect, setdiff, setequal, union
library(workflowr)
This is workflowr version 1.1.1
Run ?workflowr for help getting started
#human rev is in human coordinates 
human_rev= read.table("../data/liftover/filtered_humanPeakslifted_reverse.bed", col.names = c("human_chr", "human_start", "human_end", "name"), stringsAsFactors = F)
#final coords are in chimp coordinates  
human_lifted=read.table("../data/liftover/filtered_humanPeakslifted_reverse.finalCcords.bed", col.names = c("chimp_chr", "chimp_start", "chimp_end", "name"), stringsAsFactors = F)
I want to join these files by the name of the peaks keeping only the peaks that are in the final lifted.
human_final=human_lifted %>% left_join(human_rev, by="name")
#chimp rev in chimp cords  
chimp_rev=read.table("../data/liftover/filtered_chimpPeaks.lifted.reverse.bed", col.names = c("chimp_chr", "chimp_start", "chimp_end", "name"), stringsAsFactors = F)
#final chimp lift is in human coords  
chimp_lifted=read.table("../data/liftover/filtered_chimpPeaks.lifted.reverse.finalHcords.bed", col.names=c( "human_chr", "human_start", "human_end", "name"),stringsAsFactors = F )
Join the files
chimp_final=chimp_lifted %>% left_join(chimp_rev, by="name")
union_peaks=human_final %>% inner_join(chimp_final, by=c("human_chr", "human_start", "human_end", "chimp_chr", "chimp_start", "chimp_end" ))
This leaves 76207
I can then seperate these and write out the bedfile. With that I can look at metrics such as how many per gene or distance to last exon.
human_ortho= union_peaks %>% select(human_chr, human_start, human_end, name.x) 
chimp_ortho= union_peaks %>% select(chimp_chr, chimp_start, chimp_end, name.y)
Write these:
write.table(human_ortho, file="../data/liftover/humanOrthoPeaks.bed", quote = F, row.names = F, col.names = F,sep="\t")
write.table(chimp_ortho, file="../data/liftover/chimpOrthoPeaks.bed", quote= F, row.names = F, col.names = F, sep="\t")
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] workflowr_1.1.1 dplyr_0.7.6    
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      knitr_1.20        bindr_0.1.1      
 [4] whisker_0.3-2     magrittr_1.5      tidyselect_0.2.4 
 [7] R6_2.2.2          rlang_0.2.1       stringr_1.3.1    
[10] tools_3.5.1       R.oo_1.22.0       git2r_0.23.0     
[13] htmltools_0.3.6   yaml_2.1.19       rprojroot_1.3-2  
[16] digest_0.6.15     assertthat_0.2.0  tibble_1.4.2     
[19] crayon_1.3.4      bindrcpp_0.2.2    purrr_0.2.5      
[22] R.utils_2.6.0     glue_1.3.0        evaluate_0.11    
[25] rmarkdown_1.10    stringi_1.2.4     pillar_1.3.0     
[28] compiler_3.5.1    backports_1.1.2   R.methodsS3_1.7.1
[31] pkgconfig_2.0.1  
This reproducible R Markdown analysis was created with workflowr 1.1.1