Last updated: 2017-11-27
Code version: c9cf54b
I will use this analysis to look at inital data QC and points of interests.
First I looked at the number of reads that mapp to the genome before and after deduplication UMI steps.
samtools view -c -F 4
For flag info: https://broadinstitute.github.io/picard/explain-flags.html
Mayer data: fastq: 137281933
sorted: 120124203 dedup: 2262387
dedup/sorted: 0.01883373
library= c( "18486-dep", "18508-dep", "18508-nondep", "19238-dep", "mayer")
fastq= c( 45803834, 70776230, 77223987, 113160855, 137281933)
sorted= c(17336796, 43247747, 50189574, 40420633, 17157730 )
dedup= c(1533069, 1776330,1919904,
1870359,2262387)
perc= dedup/sorted
reads_mapped_dedup= data.frame(rbind(library, fastq, sorted, dedup, perc))
reads_mapped_dedup
X1 X2 X3
library 18486-dep 18508-dep 18508-nondep
fastq 45803834 70776230 77223987
sorted 17336796 43247747 50189574
dedup 1533069 1776330 1919904
perc 0.0884286231435151 0.0410733534859053 0.0382530443474177
X4 X5
library 19238-dep mayer
fastq 113160855 137281933
sorted 40420633 17157730
dedup 1870359 2262387
perc 0.0462723827209732 0.131858177043234
total_reads= sum(fastq)
sorted/fastq
[1] 0.3785010 0.6110490 0.6499221 0.3571962 0.1249817
Make a stacked bar plot to show complexity and coverage.
library, fastq, mapped, dedup
counts= rbind(fastq, sorted, dedup)
colnames(counts)= library
count_plot=barplot(as.matrix(counts), main="Counts for coverage and complexity",
xlab="Library", col=c("lightskyblue2","dodgerblue1","navy"),
ylab="Read counts")
legend("topleft", legend = c("total", "mapped", "UMI"), col=c("lightskyblue2","dodgerblue1","navy"), pch=20, cex = .75)
percent_mapped= sorted/fastq
percent_UMI= dedup/fastq
percent_not_mapped= 1- percent_mapped - percent_UMI
prop=rbind(percent_not_mapped, percent_mapped, percent_UMI)
colnames(prop)= library
prop_plot=barplot(as.matrix(prop), main="Proportions for coverage and complexity",
xlab="Library", col=c("lightskyblue2","dodgerblue1","navy"),
ylab="Proportion of sequenced reads")
legend("bottomright", legend = c("Un-mapped", "mapped", "UMI"), col=c("lightskyblue2","dodgerblue1","navy"), pch=20, cex = 0.75)
Undetermined is nothing: it corresponds to random reads
From meeting:
Allign with star and bwa to compare
compare to http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr7%3A5568588%2D5568715&hgsid=642260271_FLEwDANY0lSWCFhW4QjbmbASDDnB
#index and prepare ref genome:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir /scratch/midway2/brimittleman/star_genome/ --genomeFastaFiles /project2/gilad/briana/Net-seq/STAR_genome/hg19.fa --sjdbGTFfile /project2/gilad/briana/Net-seq/Homo_sapiens.GRCh37.75.chr.gtf --sjdbOverhang 43
# --sjdbOverhang read length -1
STAR --runThreadN 4 --genomeDir /scratch/midway2/brimittleman/star_genome/ --readFilesIn fastq_extr/SRR1575922_extracted.fastq --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --outStd BAM_SortedByCoordinate > star/mayer_star_align.bam
samtools sort -o star.sort/star_mayer.sort.bam star/mayer_star_align.bam
Run this on my data as well.
#!/bin/bash
#SBATCH --job-name=star_align_mayer
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --tasks-per-node=4
module load Anaconda3
source activate net-seq
STAR --runThreadN 4 --genomeDir /scratch/midway2/brimittleman/star_genome/ --readFilesIn fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --outStd BAM_SortedByCoordinate > star/star_18486-dep.bam
#to run: sbatch --partition=broadwl --mem=50G
Continue with the sort and index the bam.
samtools sort -o star.sort/star_18486_dep_sort.bam star/star_18486-dep.bam
samtools index star_18486_dep_sort.bam
Look at the percent mapped with star.
samples=c("18486_dep", "mayer")
fastq_star= c(45803834,137281933)
bam_star= c(1996777,1993674)
bam_star/fastq_star
[1] 0.04359410 0.01452248
Thats way too low. This didnt work.
In the log.Final file.
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 79.96%
% of reads unmapped: other | 0.00%
“–outFilterScoreMinOverLread 0 –outFilterMatchNminOverLread 0 –outFilterMatchNmin 0” Try to add these parameters on the mayer map.
This run gave too many multi-map reads. 64.82%.
Try:
“–outFilterScoreMinOverLread 0.3 –outFilterMatchNminOverLread 0.3”
% of reads mapped to too many loci | 63.75%
Test length of fastq reads:
Mayer: total 137281933 avg=70.000000 stddev=0.000000 18486_dep: total 45803834 avg=44.000000 stddev=0.000000
Try clipping last 10 bases with : “–clip3pNbases 10”
* This didnt work for the mayer data but that is long. I will try it on ours.
Our data:
Other ways to fix this:
BWA-backtrack - for Illumina seqs up to 100 bp
First step is to construt a FM-index for the reference genome.
“bwa index -a bwtsw -p /scratch/midway2/brimittleman/BWA_genome/BWA.index STAR_genome/hg19.fa”
Added bwa to envirnoment
Mapping:
bwa aln
creates the .sai index files
-n 0.01 1% missmatch allowed
-t 3 spead up by using 3 threads
bwa samse
#$1 ref fa
#$2 fastq
#$3 output sai
module load Anaconda3
source activate net-seq
module load bwa
bwa aln -t 3 -n 0.01 $1 $2 > $3
#submit
sbatch scripts/bwa_aln.sh hg19.copy.fa /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sai
#SBATCH --job-name=BWA_samse
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END
#$1 ref fasta
#$2 sai file
#$3 fastq file
#$4 output sam
module load Anaconda3
source activate net-seq
module load bwa
bwa samse $1 $2 $3 > $4
#run on mayer
sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.sai /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.sam
sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.cut3prime.sai /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.cut3prime.sam
#run on 18486 dep
sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sai /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sam
Sam to bam:
samtools view -S -b sample.sam > sample.bam
Check how big the file is:
This is super low mapping as well. Not sure what is going on.
For poor quality on the ends- add -q 15 to the bwa aln command. I am trying this on the mayer data.
I deleted the reference genome and am reindexing and rebuilding it.
Code from Sebs snakemake will allow me to cut any read that has more than 6 As. It will then keep the read if it is longer than 25 bases long post cut. I will run this on the UMI extracted fastq files.
This script is called cut_polyA.sh and is in the /project2/gilad/briana/Net-seq/scripts directory.
#!/bin/bash
#SBATCH --job-name=cut_polyA
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=8G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END
module load Anaconda3
source activate net-seq
#$1 fastq file
#$2 output cut file name
cutadapt --minimum-length 25 -a AAAAAA -o $2 $1
Run this script first on 18486 dep.
sbatch scripts/cut_polyA.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq
Pre-cut: 45803834
Cut: 40905492
sbatch scripts/star_allign.sh /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/star/star_18486-dep_cutPolyA.bam
Cut: 40905492
mapped: 1350684
I am running subjunc on the polyA cut reads with the –allJunctions to map cononincal and non-connoical exon exon boundaries.
This script is called subjunc_all_junc.sh and is in the /project2/gilad/briana/Net-seq/scripts directory.
#!/bin/bash
#SBATCH --job-name=cut_polyA
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END
module load Anaconda3
source activate net-seq
#$1 input extracted fast q
#$2 output bam
subjunc --allJunctions -i /project2/gilad/briana/Net-seq/Net-seq1/genome/ -r $1 -T 8 > $2
#slurm-40290339.out
sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.all.junc.bam
Before subjunc mapped 37.85. Now it mapped 38.4%.
I am also going to run this on the non polyA cut samples.
#slurm-40290637.out
sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.all.junc.bam
Before subjunc mapped 37.85. This run is 53.1%
Try for mayer:
sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq /project2/gilad/briana/Net-seq/data/subjunc_all_junc/mayer.extracted.subjunc.all.junc.bam
#508 dep
sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18508-dep-2017-10-13_S2_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18508-dep-2017-10-13_S2_R1_001_extracted.all.junc.bam
508_dep
dep fastq: 70776230
mapped: 54088856
This mapped 76.4%
samtools sort -o {output} {input}
samtools index {input}
umi_tools dedup -I {input.bam} -S {output}
I downloaded HeLa S3 Rep1 and the other run for HEK293T Rep1. I ran the snake file in the mayer.data directory as Mayer_hek and Mayer_hela.
mayer_hek
reads: 358754064
mapped: 128152521
deduplication: 4392741
mayer_hela
reads: 175303176
mapped: 51362897
deduplication: 6314281
m_fastq= c(358754064,175303176)
m_sort= c(128152521 , 51362897)
m_dedup= c(4392741, 6314281 )
mayer= c("Hek", "Hela")
counts_m= rbind(m_fastq, m_sort, m_dedup)
colnames(counts_m)= mayer
count_plot_m=barplot(as.matrix(counts_m), main="Counts for coverage and complexity",
xlab="Library", col=c("lightskyblue2","dodgerblue1","navy"),
ylab="Read counts")
legend("topright", legend = c("total", "mapped", "UMI"), col=c("lightskyblue2","dodgerblue1","navy"), pch=20, cex = .75)
percent_mapped_m= m_sort/m_fastq
percent_UMI_m= m_dedup/m_fastq
percent_not_mapped_m= 1- percent_mapped_m - percent_UMI_m
prop_m=rbind(percent_not_mapped_m, percent_mapped_m, percent_UMI_m)
colnames(prop_m)= mayer
prop_plot_m=barplot(as.matrix(prop_m), main="Proportions for coverage and complexity",
xlab="Library", col=c("lightskyblue2","dodgerblue1","navy"),
ylab="Proportion of sequenced reads")
legend("bottomright", legend = c("Un-mapped", "mapped", "UMI"), col=c("lightskyblue2","dodgerblue1","navy"), pch=20, cex = 0.75)
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 backports_1.1.1 magrittr_1.5 rprojroot_1.2
[5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.13
[9] stringi_1.1.5 rmarkdown_1.6 knitr_1.17 git2r_0.19.0
[13] stringr_1.2.0 digest_0.6.12 evaluate_0.10.1
This R Markdown site was created with workflowr