Last updated: 2018-03-02
Code version: 646ee57
The pilot project has 16 human lines of net-seq data. I have 3 lanes of sequencing for each library of 8. The first library is the same 8 samples from Net3. I can use zcat to combine the samles from seperate lanes into one file. I will call these [sample_name]_combined_Netpilot.fastq. I will use the project directory /project2/gilad/briana/Net-seq-pilot
I will update my snakefile by copying the old one as Snakefile_old.
Changes:
change genome location to /project2/gilad/briana/genome_anotation_data
Add picard tools annotation to the snakefile
compute genome coverage for the dedup and sorted bam files
My old snakefile create the reference genome for every new analysis. It would be better to build the reference once and always keep it in one place. I will now be in /project2/gilad/briana/genome_anotation_data. To do this I will change the snakefile to point to this location and change the config file to point to the correct location.
#config file
dir_genome: /project2/gilad/briana/genome_anotation_data/
#snakefile
dir_gen=config["dir_genome"]
dir_genome= dir_gen + "genome/"
Picard tools will take in a flat file and a rRna file and will assess coverage at genomic regions. Before I make the rule I need to convert the gencode.v19.annotation.gtf to a flat file.
Update net-seq environment:
#convert gtf to genepred flat
gtfToGenePred -genePredExt -ignoreGroupsWithoutExons gencode.v19.annotation.gtf gencode.v19.annotation.refFlat
Adapt rule from Seb’s Snakefile
#config file
ref_flat: /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.refFlat
ribosome_intervals: /project2/gilad/briana/genome_anotation_data/hg19.rRNA.intervals
#snakefile
ref_flat = config["ref_flat"]
ribosome_int = config["ribosome_intervals"]
#rule
rule collect_rna_metrics:
input:
bam = dir_sort + "{samples}-sort.bam"
ref_flat = ref_flat,
ribosome_int = ribosome_int
output:
rna_metrics = output + "{samples}_RNAmetrics.picard.txt"
params:
strand = "NONE",
memory = "-Xmx12G"
shell:
"picard CollectRnaSeqMetrics {params.memory} I={input.bam} O={output.rna_metrics} REF_FLAT={input.ref_flat} STRAND={params.strand} RIBOSOMAL_INTERVALS={input.ribosome_int}"
The old snakefile has a genome_cov rule for genome cov of the deduplicated files. I want to do this for the sort and dedup files. I also will change the rule so it is -d -5.
rule genome_cov_dedup:
input:
bamdedup = dir_dedup + "{samples}-sort.dedup.bam",
genome = dir_genome + ensembl_genome + ".reads"
output:
dir_cov + "{samples}-sort.dedup.cov.bed"
shell: "bedtools genomecov -d -5 -ibam {input.bamdedup} -g {input.genome} > {output}"
rule genome_cov_sort:
input:
bamsort = dir_sort + "{samples}-sort.bam"
genome = dir_genome + ensembl_genome + ".reads"
output:
dir_cov + "{samples}-sort.cov.bed"
shell: "bedtools genomecov -d -5 -ibam {input.bamsort} -g {input.genome} > {output}"
#in rule all:
expand(dir_cov + "{samples}-sort.dedup.cov.bed", samples=samples),
expand(dir_cov + "{samples}-sort.cov.bed", samples=samples),
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
[5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.16 Rcpp_0.12.15
[9] stringi_1.1.6 rmarkdown_1.8.5 knitr_1.18 git2r_0.21.0
[13] stringr_1.2.0 digest_0.6.14 evaluate_0.10.1
This R Markdown site was created with workflowr