Initial Data Exploration Netseq1

Last updated: 2017-11-16

Code version: 199a255

I will use this analysis to look at inital data QC and points of interests.

First I looked at the number of reads that mapp to the genome before and after deduplication UMI steps.

samtools view -c -F 4

For flag info: https://broadinstitute.github.io/picard/explain-flags.html

Mayer data: fastq: 137281933
sorted: 120124203 dedup: 2262387
dedup/sorted: 0.01883373

library= c( "18486-dep", "18486-tot", "18508-dep", "18508-nondep", "19238-dep", "mayer")

fastq= c( 45803834, 4006, 70776230, 77223987, 113160855, 137281933)  

sorted= c(17336796, 1258, 43247747, 50189574, 40420633, 17157730 )

dedup= c(1533069, 1105, 1776330,1919904,
1870359,2262387)

perc= dedup/sorted

reads_mapped_dedup= data.frame(rbind(library, fastq, sorted, dedup, perc))

reads_mapped_dedup

                        X1                X2                 X3
library          18486-dep         18486-tot          18508-dep
fastq             45803834              4006           70776230
sorted            17336796              1258           43247747
dedup              1533069              1105            1776330
perc    0.0884286231435151 0.878378378378378 0.0410733534859053
                        X4                 X5                X6
library       18508-nondep          19238-dep             mayer
fastq             77223987          113160855         137281933
sorted            50189574           40420633          17157730
dedup              1919904            1870359           2262387
perc    0.0382530443474177 0.0462723827209732 0.131858177043234

total_reads= sum(fastq)

sorted/fastq

[1] 0.3785010 0.3140290 0.6110490 0.6499221 0.3571962 0.1249817

Undetermined is nothing: it corresponds to random reads

From meeting:

Allign with star and bwa to compare
compare to http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr7%3A5568588%2D5568715&hgsid=642260271_FLEwDANY0lSWCFhW4QjbmbASDDnB

Explore different mappers


#index and prepare ref genome:

STAR --runThreadN 2 --runMode genomeGenerate --genomeDir /scratch/midway2/brimittleman/star_genome/ --genomeFastaFiles  /project2/gilad/briana/Net-seq/STAR_genome/hg19.fa  --sjdbGTFfile /project2/gilad/briana/Net-seq/Homo_sapiens.GRCh37.75.chr.gtf --sjdbOverhang 43

# --sjdbOverhang read length -1 


STAR --runThreadN 4 --genomeDir /scratch/midway2/brimittleman/star_genome/ --readFilesIn fastq_extr/SRR1575922_extracted.fastq --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --outStd BAM_SortedByCoordinate > star/mayer_star_align.bam

samtools sort -o star.sort/star_mayer.sort.bam star/mayer_star_align.bam

Run this on my data as well.

#!/bin/bash

#SBATCH --job-name=star_align_mayer
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --tasks-per-node=4 


module load Anaconda3
source activate net-seq

STAR --runThreadN 4 --genomeDir /scratch/midway2/brimittleman/star_genome/ --readFilesIn fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq  --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --outStd BAM_SortedByCoordinate > star/star_18486-dep.bam

#to run: sbatch --partition=broadwl --mem=50G

Continue with the sort and index the bam.

samtools sort -o star.sort/star_18486_dep_sort.bam  star/star_18486-dep.bam
samtools index star_18486_dep_sort.bam

Look at the percent mapped with star.

samples=c("18486_dep", "mayer")
fastq_star= c(45803834,137281933)
bam_star= c(1996777,1993674)
bam_star/fastq_star

[1] 0.04359410 0.01452248

Thats way too low. This didnt work.

In the log.Final file.

% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 79.96%
% of reads unmapped: other | 0.00%

“–outFilterScoreMinOverLread 0 –outFilterMatchNminOverLread 0 –outFilterMatchNmin 0” Try to add these parameters on the mayer map.

This run gave too many multi-map reads. 64.82%.

Try:
“–outFilterScoreMinOverLread 0.3 –outFilterMatchNminOverLread 0.3”
% of reads mapped to too many loci | 63.75%

Test length of fastq reads:
Mayer: total 137281933 avg=70.000000 stddev=0.000000 18486_dep: total 45803834 avg=44.000000 stddev=0.000000

Try clipping last 10 bases with : “–clip3pNbases 10”
* This didnt work for the mayer data but that is long. I will try it on ours.

Our data:

% of reads mapped to too many loci | 31.70%
% of reads unmapped: too short | 55.32%

Other ways to fix this:

try blasting the unmapped reads

Look into BWA mapping

BWA-backtrack - for Illumina seqs up to 100 bp

First step is to construt a FM-index for the reference genome.

“bwa index -a bwtsw -p /scratch/midway2/brimittleman/BWA_genome/BWA.index STAR_genome/hg19.fa”

Added bwa to envirnoment

Mapping:

bwa aln

creates the .sai index files
-n 0.01 1% missmatch allowed
-t 3 spead up by using 3 threads

bwa samse

generates alignments in a sam format


#$1 ref fa  
#$2 fastq 
#$3 output sai 


module load Anaconda3
source activate net-seq
module load bwa

bwa aln -t 3 -n 0.01 $1 $2 > $3


#submit 
sbatch scripts/bwa_aln.sh hg19.copy.fa /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sai



#SBATCH --job-name=BWA_samse
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END

#$1 ref fasta
#$2 sai file  
#$3 fastq file  
#$4 output sam  


module load Anaconda3
source activate net-seq
module load bwa

bwa samse $1 $2 $3 > $4


#run on mayer  
sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.sai /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq  /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.sam


sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.cut3prime.sai  /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq  /project2/gilad/briana/Net-seq/data/bwa/bwa.mayer.cut3prime.sam




#run on 18486 dep  
sbatch scripts/bwa_samse.sh hg19.copy.fa /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sai /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq  /project2/gilad/briana/Net-seq/Net-seq1/data/BWA/bwa.18486.dep.sam

Sam to bam:
samtools view -S -b sample.sam > sample.bam

Check how big the file is:

18486_dep : 796546
Mayer: 117726

This is super low mapping as well. Not sure what is going on.

For poor quality on the ends- add -q 15 to the bwa aln command. I am trying this on the mayer data.

18486_dep : 805899
Mayer: 121892

Rerun star:

I deleted the reference genome and am reindexing and rebuilding it.

Cut polyA

Code from Sebs snakemake will allow me to cut any read that has more than 6 As. It will then keep the read if it is longer than 25 bases long post cut. I will run this on the UMI extracted fastq files.

This script is called cut_polyA.sh and is in the /project2/gilad/briana/Net-seq/scripts directory.

#!/bin/bash

#SBATCH --job-name=cut_polyA
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=8G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END

module load Anaconda3
source activate net-seq

#$1 fastq file 
#$2 output cut file name 

cutadapt --minimum-length 25 -a AAAAAA -o $2 $1

Run this script first on 18486 dep.

sbatch scripts/cut_polyA.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq

Pre-cut: 45803834
Cut: 40905492

sbatch scripts/star_allign.sh /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/star/star_18486-dep_cutPolyA.bam

Cut: 40905492
mapped: 1350684

Subjunc with –allJunctions

I am running subjunc on the polyA cut reads with the –allJunctions to map cononincal and non-connoical exon exon boundaries.

This script is called subjunc_all_junc.sh and is in the /project2/gilad/briana/Net-seq/scripts directory.

#!/bin/bash

#SBATCH --job-name=cut_polyA
#SBATCH --time=8:00:00
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --tasks-per-node=4
#SBATCH --mail-type=END

module load Anaconda3
source activate net-seq

#$1 input extracted fast q  
#$2 output bam 


subjunc --allJunctions -i /project2/gilad/briana/Net-seq/Net-seq1/genome/ -r $1 -T 8 > $2

#slurm-40290339.out

sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/cut_polyA/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.cutPolyA.all.junc.bam

Before subjunc mapped 37.85. Now it mapped 38.4%.

I am also going to run this on the non polyA cut samples.

#slurm-40290637.out

sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18486-dep-2017-10-13_S4_R1_001_extracted.all.junc.bam

Before subjunc mapped 37.85. This run is 53.1%

Try for mayer:

sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/data/fastq_extr/SRR1575922_extracted.fastq /project2/gilad/briana/Net-seq/data/subjunc_all_junc/mayer.extracted.subjunc.all.junc.bam

#508 dep
sbatch scripts/subjunc_all_junc.sh /project2/gilad/briana/Net-seq/Net-seq1/data/fastq_extr/YG-SP-NET1-18508-dep-2017-10-13_S2_R1_001_extracted.fastq /project2/gilad/briana/Net-seq/Net-seq1/data/subjunc_all_junc/YG-SP-NET1-18508-dep-2017-10-13_S2_R1_001_extracted.all.junc.bam

508_dep
dep fastq: 70776230
mapped: 54088856

This mapped 76.4%

samtools sort -o {output} {input}
samtools index {input}
umi_tools dedup -I {input.bam} -S {output}

Session information

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.13   
 [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      git2r_0.19.0   
[13] stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

This R Markdown site was created with workflowr