Set up my Snake and Config files

Last updated: 2017-10-26

Code version: 7748ac5

The snake and config files are included in the net_seq_pipeline folder. This directory has the human reference genome and the conda environment file.

Set up config file:

# Snakemake configuration file

# Specify paths to data files

# Paths must end with forward slash

#project directory 
dir_proj: /project2/gilad/briana/net_seq_pipeline/

#directory with aditional scripts:
scripts: /project2/gilad/briana/net_seq_pipeline/scripts/

# Make sure to also update path for log files in cluster.json
dir_log: log/

# Specify Ensembl release for genome sequence and annotation
# http://feb2014.archive.ensembl.org/index.html
ensembl_archive: feb2014.archive.ensembl.org
ensembl_rel: 75
ensembl_genome: GRCh37.75

Make this file executable:

chmod 700 config.yaml

Set up snakefile:

#Snakefile
#
#This file will run the net-seq pipeline from fastq files including assembling reference genome
#
#To configure the paths to data files and other settings, edit
#config.yaml
#
#to configure job submission settings for cluster, edit
#cluster.json and submit.snakemake.sh

#to run on RCC midway2 use 'bash submit-snakemake.sh'


import glob
import os
from snakemake.utils import R

#Configuration -------------------------------------  


configfile:  "config.yaml"


#Specifu Ensembl release for genome sequence and annotation
ensembl_archive = config["ensembl_archive"]
ensembl_rel = config["ensembl_rel"]
ensembl_ftp = "ftp://ftp.ensembl.org/pub/release-" + \
              str(ensembl_rel) + \
              "/fasta/homo_sapiens/dna/"
ensembl_exons = "exons-ensembl-release-" + str(ensembl_rel) + ".saf"
ensembl_genome = config["ensembl_genome"]

#Paths for data (end with forward slash)
dir_proj=config["dir_proj"]
dir_data= dir_proj + "data/"
fastq_dir= data + "fastq/"

assert os.path.exists(dir_proj), "Project directory exists"

#Directory to send log files. Needs to be created manually since it
#is not a file created by a Snakemake rule.
dir_log = config["dir_log"] 
if not os.path.isdir(dir_log):
     os.mkdir(dir_log)

Rules to add to my snake file

Rules to download and index genome

This is from John’s cardioQTL snake file ``` rule download_genome: output: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” params: chr = “{chr}”, build = ensembl_genome shell: “wget -O {output} {ensembl_ftp}Homo_sapiens.{params.build}.dna_sm.chromosome.{params.chr}.fa.gz”

rule unzip_chromosome_fasta: input: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” output: temp(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”) shell: “zcat {input} > {output}”

rule subread_index: input: expand(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”,
chr = chr_genes) output: dir_genome + ensembl_genome + “.reads” params: prefix = dir_genome + ensembl_genome shell: “subread-buildindex -o {params.prefix} {input}” ```

Cluster

Will need to create a cluster.json file where I specify the memory for each of my rules come back to this

Session information

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.13   
 [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      git2r_0.19.0   
[13] stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

This R Markdown site was created with workflowr