Last updated: 2017-10-26
Code version: 7748ac5
The snake and config files are included in the net_seq_pipeline folder. This directory has the human reference genome and the conda environment file.
# Snakemake configuration file
# Specify paths to data files
# Paths must end with forward slash
#project directory 
dir_proj: /project2/gilad/briana/net_seq_pipeline/
#directory with aditional scripts:
scripts: /project2/gilad/briana/net_seq_pipeline/scripts/
# Make sure to also update path for log files in cluster.json
dir_log: log/
# Specify Ensembl release for genome sequence and annotation
# http://feb2014.archive.ensembl.org/index.html
ensembl_archive: feb2014.archive.ensembl.org
ensembl_rel: 75
ensembl_genome: GRCh37.75
Make this file executable:
chmod 700 config.yaml#Snakefile
#
#This file will run the net-seq pipeline from fastq files including assembling reference genome
#
#To configure the paths to data files and other settings, edit
#config.yaml
#
#to configure job submission settings for cluster, edit
#cluster.json and submit.snakemake.sh
#to run on RCC midway2 use 'bash submit-snakemake.sh'
import glob
import os
from snakemake.utils import R
#Configuration -------------------------------------  
configfile:  "config.yaml"
#Specifu Ensembl release for genome sequence and annotation
ensembl_archive = config["ensembl_archive"]
ensembl_rel = config["ensembl_rel"]
ensembl_ftp = "ftp://ftp.ensembl.org/pub/release-" + \
              str(ensembl_rel) + \
              "/fasta/homo_sapiens/dna/"
ensembl_exons = "exons-ensembl-release-" + str(ensembl_rel) + ".saf"
ensembl_genome = config["ensembl_genome"]
#Paths for data (end with forward slash)
dir_proj=config["dir_proj"]
dir_data= dir_proj + "data/"
fastq_dir= data + "fastq/"
assert os.path.exists(dir_proj), "Project directory exists"
#Directory to send log files. Needs to be created manually since it
#is not a file created by a Snakemake rule.
dir_log = config["dir_log"] 
if not os.path.isdir(dir_log):
     os.mkdir(dir_log)
This is from John’s cardioQTL snake file ``` rule download_genome: output: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” params: chr = “{chr}”, build = ensembl_genome shell: “wget -O {output} {ensembl_ftp}Homo_sapiens.{params.build}.dna_sm.chromosome.{params.chr}.fa.gz”
rule unzip_chromosome_fasta: input: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” output: temp(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”) shell: “zcat {input} > {output}”
rule subread_index: input: expand(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”,
chr = chr_genes) output: dir_genome + ensembl_genome + “.reads” params: prefix = dir_genome + ensembl_genome shell: “subread-buildindex -o {params.prefix} {input}” ```
Will need to create a cluster.json file where I specify the memory for each of my rules come back to this
sessionInfo()R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
 [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.13   
 [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      git2r_0.19.0   
[13] stringr_1.2.0   digest_0.6.12   evaluate_0.10.1This R Markdown site was created with workflowr