生信分析_1


practice guide - linux

本章内容主要是学会用Linux的一些简单编程方法去查看GTF/GFF基因组注释文件的基本信息,并学会对文件中数据进行提取,利用提取到的数据计算特定feature(例如计算基因积累长度等)。

详见practice guide

Homework

  • 列出1.gtf文件中 XI 号染色体上的后 10 个 CDS (按照每个CDS终止位置的基因组坐标进行sort)

    代码实现

      grep CDS 1.gtf |awk '$1 == "XI"' | sort -t $'\t' -k 5 | tail

    ==注意:$ 会将’\t’转置为TAB分隔符, -k 5是按照第五列也就是CDS终止位置进行排序==

`输出结果为`

```shell
XI    ensembl    CDS    82947    82998    .    +    0    gene_id "YKL190W"; gene_version "1"; transcript_id "YKL190W"; transcript_version "1"; exon_number "1"; gene_name "CNB1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "CNB1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL190W"; protein_version "1";
XI    ensembl    CDS    83075    83547    .    +    1    gene_id "YKL190W"; gene_version "1"; transcript_id "YKL190W"; transcript_version "1"; exon_number "2"; gene_name "CNB1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "CNB1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL190W"; protein_version "1";
XI    ensembl    CDS    84704    85900    .    +    0    gene_id "YKL189W"; gene_version "1"; transcript_id "YKL189W"; transcript_version "1"; exon_number "1"; gene_name "HYM1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "HYM1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL189W"; protein_version "1";
XI    ensembl    CDS    86228    88786    .    -    0    gene_id "YKL188C"; gene_version "1"; transcript_id "YKL188C"; transcript_version "1"; exon_number "1"; gene_name "PXA2"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "PXA2"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL188C"; protein_version "1";
XI    ensembl    CDS    89287    91536    .    -    0    gene_id "YKL187C"; gene_version "1"; transcript_id "YKL187C"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "YKL187C"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL187C"; protein_version "1";
XI    ensembl    CDS    92747    93298    .    -    0    gene_id "YKL186C"; gene_version "1"; transcript_id "YKL186C"; transcript_version "1"; exon_number "1"; gene_name "MTR2"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "MTR2"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL186C"; protein_version "1";
XI    ensembl    CDS    94499    96262    .    +    0    gene_id "YKL185W"; gene_version "1"; transcript_id "YKL185W"; transcript_version "1"; exon_number "1"; gene_name "ASH1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "ASH1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL185W"; protein_version "1";
XI    ensembl    CDS    96757    98154    .    +    0    gene_id "YKL184W"; gene_version "1"; transcript_id "YKL184W"; transcript_version "1"; exon_number "1"; gene_name "SPE1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SPE1"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL184W"; protein_version "1";
XI    ensembl    CDS    98398    98607    .    -    0    gene_id "YKL183C-A"; gene_version "1"; transcript_id "YKL183C-A"; transcript_version "1"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "YKL183C-A"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL183C-A"; protein_version "1";
XI    ensembl    CDS    98721    99638    .    +    0    gene_id "YKL183W"; gene_version "1"; transcript_id "YKL183W"; transcript_version "1"; exon_number "1"; gene_name "LOT5"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "LOT5"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "YKL183W"; protein_version "1";
```
  • 统计 IV 号染色体上各类 feature (1.gtf文件的第3列,有些注释文件中还应同时考虑第2列) 的数目,并按升序排列

    代码实现

      grep -v '^#' 1.gtf |awk '$1 == "IV"{print $3}'| sort | uniq -c 

    ==grep -v ‘^#’ 是为了去除以#开头的无关项==

    输出为


文章作者: 张忠楠
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 张忠楠 !
评论
 上一篇
生信分析_2 生信分析_2
Linux Bash 在介绍bash之前,需要先介绍它的起源——shell。shell俗称壳,它是指UNIX系统下的一个命令解析器;主要用于用户和系统的交互; bash,全称为Bourne-Again Shell。它是一个为GNU项目编写的
2020-05-03
下一篇 
git git
Mac上使用Git+Github 详细教程参见Mac上使用Git+Github
2020-04-30
  目录