Last updated: 2017-10-26
Code version: 7748ac5
The snake and config files are included in the net_seq_pipeline folder. This directory has the human reference genome and the conda environment file.
# Snakemake configuration file
# Specify paths to data files
# Paths must end with forward slash
#project directory
dir_proj: /project2/gilad/briana/net_seq_pipeline/
#directory with aditional scripts:
scripts: /project2/gilad/briana/net_seq_pipeline/scripts/
# Make sure to also update path for log files in cluster.json
dir_log: log/
# Specify Ensembl release for genome sequence and annotation
# http://feb2014.archive.ensembl.org/index.html
ensembl_archive: feb2014.archive.ensembl.org
ensembl_rel: 75
ensembl_genome: GRCh37.75
Make this file executable:
chmod 700 config.yaml
#Snakefile
#
#This file will run the net-seq pipeline from fastq files including assembling reference genome
#
#To configure the paths to data files and other settings, edit
#config.yaml
#
#to configure job submission settings for cluster, edit
#cluster.json and submit.snakemake.sh
#to run on RCC midway2 use 'bash submit-snakemake.sh'
import glob
import os
from snakemake.utils import R
#Configuration -------------------------------------
configfile: "config.yaml"
#Specifu Ensembl release for genome sequence and annotation
ensembl_archive = config["ensembl_archive"]
ensembl_rel = config["ensembl_rel"]
ensembl_ftp = "ftp://ftp.ensembl.org/pub/release-" + \
str(ensembl_rel) + \
"/fasta/homo_sapiens/dna/"
ensembl_exons = "exons-ensembl-release-" + str(ensembl_rel) + ".saf"
ensembl_genome = config["ensembl_genome"]
#Paths for data (end with forward slash)
dir_proj=config["dir_proj"]
dir_data= dir_proj + "data/"
fastq_dir= data + "fastq/"
assert os.path.exists(dir_proj), "Project directory exists"
#Directory to send log files. Needs to be created manually since it
#is not a file created by a Snakemake rule.
dir_log = config["dir_log"]
if not os.path.isdir(dir_log):
os.mkdir(dir_log)
This is from John’s cardioQTL snake file ``` rule download_genome: output: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” params: chr = “{chr}”, build = ensembl_genome shell: “wget -O {output} {ensembl_ftp}Homo_sapiens.{params.build}.dna_sm.chromosome.{params.chr}.fa.gz”
rule unzip_chromosome_fasta: input: dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa.gz” output: temp(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”) shell: “zcat {input} > {output}”
rule subread_index: input: expand(dir_genome + “Homo_sapiens.” + ensembl_genome +
“.dna_sm.chromosome.{chr}.fa”,
chr = chr_genes) output: dir_genome + ensembl_genome + “.reads” params: prefix = dir_genome + ensembl_genome shell: “subread-buildindex -o {params.prefix} {input}” ```
Will need to create a cluster.json file where I specify the memory for each of my rules come back to this
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 backports_1.1.1 magrittr_1.5 rprojroot_1.2
[5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.13
[9] stringi_1.1.5 rmarkdown_1.6 knitr_1.17 git2r_0.19.0
[13] stringr_1.2.0 digest_0.6.12 evaluate_0.10.1
This R Markdown site was created with workflowr