Seqkit Usage

9.4k 词

seqkit usage

Repo直达 https://github.com/shenwei356/seqkit

Introduction

seqkit是重庆医科大学病毒性肝炎研究所副研究员沈伟老师基于Go语言开发的跨平台的、快速的、全面的fasta/q处理工具

Installation

1
2
3
4
# General
conda install -c biconda seqkit
# Alternative for mac
brew install seqkit

Usage

Example file: AARS.lncRPI.bed
Description: AARS蛋白结合lncRNA位点信息
Source: POSTAR3 from LuLab@Tsinghua Univeristy

subseq

Get subsequences by region/gtf/bed, including flanking sequences.

Usage:
seqkit subseq [flags]

Flags:
–bed string by tab-delimited BED file
–chr strings select limited sequence with sequence IDs when using --gtf or --bed (multiple
value supported, case ignored)
-d, --down-stream int down stream length
–feature strings select limited feature types (multiple value supported, case ignored, only
works with GTF)
–gtf string by GTF (version 2.2) file
–gtf-tag string output this tag as sequence comment (default “gene_id”)
-h, --help help for subseq
-f, --only-flank only return up/down stream sequence
-r, --region string by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for
cutting first 12 bases. type “seqkit subseq -h” for more examples
-u, --up-stream int up stream length
-U, --update-faidx update the fasta index file if it exists. Use this if you are not sure whether
the fasta file changed

1
seqkit subseq --bed AARS.lncRPI.bed GRCh38.p13.genome.fa.gz > AARS.lncRPI.fasta

stats

Simple statistics of FASTA/Q files.

Flags:
-a, --all all statistics, including quartiles of seq length, sum_gap, N50
-b, --basename only output basename of files
-E, --fq-encoding string fastq quality encoding. available values: ‘sanger’, ‘solexa’, ‘illumina-1.3+’, ‘illumina-1.5+’, ‘illumina-1.8+’. (default “sanger”)
-G, --gap-letters string gap letters (default “- .”)
-h, --help help for stats
-e, --skip-err skip error, only show warning message
-i, --stdin-label string label for replacing default “-” for stdin (default “-”)
-T, --tabular output in machine-friendly tabular format

1
seqkit stat -a -T AARS.lncRPI.fasta | csvlook

watch

Monitoring and online histograms of sequence features

Usage:
seqkit watch [flags]

Flags:
-B, --bins int number of histogram bins (default -1)
-W, --delay int sleep this many seconds after online plotting (default 1)
-y, --dump print histogram data to stderr instead of plotting
-f, --fields string target fields, available values: ReadLen, MeanQual, GC, GCSkew
(default “ReadLen”)
-h, --help help for watch
-O, --img string save histogram to this PDF/image file
-H, --list-fields print out a list of available fields
-L, --log log10(x+1) transform numeric values
-x, --pass pass through mode (write input to stdout)
-p, --print-freq int print/report after this many records (-1 for print after EOF) (default -1)
-b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33)
-Q, --quiet-mode supress all plotting to stderr
-R, --reset reset histogram after every report
-v, --validate-seq validate bases according to the alphabet
-V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)

1
seqkit watch -f ReadLen AARS.lncRPI.fasta

seq

Transform sequences (extract ID, filter by length, remove gaps, reverse complement…)

Usage:
seqkit seq [flags]

Flags:
-k, --color colorize sequences - to be piped into “less -R”
-p, --complement complement sequence, flag ‘-v’ is recommended to switch on
–dna2rna DNA to RNA
-G, --gap-letters string gap letters to be removed with -g/–remove-gaps (default “- \t.”)
-h, --help help for seq
-l, --lower-case print sequences in lower case
-M, --max-len int only print sequences shorter than or equal to the maximum length (-1
for no limit) (default -1)
-R, --max-qual float only print sequences with average quality less than this limit (-1 for
no limit) (default -1)
-m, --min-len int only print sequences longer than or equal to the minimum length (-1
for no limit) (default -1)
-Q, --min-qual float only print sequences with average quality qreater or equal than this
limit (-1 for no limit) (default -1)
-n, --name only print names/sequence headers
-i, --only-id print IDs instead of full headers
-q, --qual only print qualities
-b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33)
-g, --remove-gaps remove gaps letters set by -G/–gap-letters, e.g., spaces, tabs, and
dashes (gaps “-” in aligned sequences)
-r, --reverse reverse sequence
–rna2dna RNA to DNA
-s, --seq only print sequences
-u, --upper-case print sequences in upper case
-v, --validate-seq validate bases according to the alphabet
-V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)

1
2
# 查看.fasta文件
seqkit seq -k AARS.lncRPI.fasta

1
2
# 提取ID
seqkit seq AARS.lncRPI.fasta -n

1
2
# 提取序列
seqkit seq AARS.lncRPI.fasta -s

1
2
# 过滤长度小于40bp的序列
seqkit seq -m 40 -g AARS.lncRPI.fasta | seqkit stats -a -T | csvlook

faidx

Create FASTA index file and extract subsequence

Usage:
seqkit faidx [flags] [regions…]

Flags:
-f, --full-head print full header line instead of just ID. >New fasta index file ending with
.seqkit.fai will be created
-h, --help help for faidx
-i, --ignore-case ignore case
-I, --immediate-output print output immediately, do not use write buffer
-l, --region-file string file containing a list of regions
-U, --update-faidx update the fasta index file if it exists. Use this if you are not sure
whether the fasta file changed
-r, --use-regexp IDs are regular expression. But subseq region is not supported here.

1
2
# 筛选位于chr15的序列
seqkit faidx AARS.lncRPI.fasta chr15 -r -f | seqkit stats -a -T | csvlook

locate

Locate subsequences/motifs, mismatch allowed

Usage:
seqkit locate [flags]

Flags:
–bed output in BED6 format
-c, --circular circular genome. type “seqkit locate -h” for details
-d, --degenerate pattern/motif contains degenerate base
–gtf output in GTF format
-h, --help help for locate
-M, --hide-matched do not show matched sequences
-i, --ignore-case ignore case
-I, --immediate-output print output immediately, do not use write buffer
-m, --max-mismatch int max mismatch when matching by seq. For large genomes like human
genome, using mapping/alignment tools would be faster
-G, --non-greedy non-greedy mode, faster but may miss motifs overlapping with others
-P, --only-positive-strand only search on positive strand
-p, --pattern strings pattern/motif (multiple values supported. Attention: use double
quotation marks for patterns containing comma, e.g., -p ‘“A{2,}”’)
-f, --pattern-file string pattern/motif file (FASTA format)
-F, --use-fmi use FM-index for much faster search of lots of sequence patterns
-r, --use-regexp patterns/motifs are regular expression
-V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)

使用MEME得到的motif

1
2
3
4
5
6
7
8
9
10
# Locate motif found by MEME
seqkit seq AARS.lncRPI.fasta \
| seqkit locate -i -d -p CACCATATTAAAGGAGGGAGTGGCAGAAAACGTGAGGTGGG \
| head \
| csvtk pretty -t
# Save the located ID
seqkit seq AARS.lncRPI.fasta \
| seqkit locate -i -d -p CACCATATTAAAGGAGGGAGTGGCAGAAAACGTGAGGTGGG \
| cut -f 1 \
| gerp -v "seqID" > extract.id

Alt text

grep

Search sequences by ID/name/sequence/sequence motifs, mismatch allowed.

Usage:
seqkit grep [flags]

Flags:
-n, --by-name match by full name instead of just ID
-s, --by-seq search subseq on seq, both positive and negative strand are searched, and
mismatch allowed using flag -m/–max-mismatch
-c, --circular circular genome
-C, --count just print a count of matching records. with the -v/–invert-match flag,
count non-matching records
-d, --degenerate pattern/motif contains degenerate base
–delete-matched delete a pattern right after being matched, this keeps the firstly
matched data and speedups when using regular expressions
-h, --help help for grep
-i, --ignore-case ignore case
-I, --immediate-output print output immediately, do not use write buffer
-v, --invert-match invert the sense of matching, to select non-matching records
-m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome,
using mapping/alignment tools would be faster
-P, --only-positive-strand only search on positive strand
-p, --pattern strings search pattern (multiple values supported. Attention: use double
quotation marks for patterns containing comma, e.g., -p ‘“A{2,}”’)
-f, --pattern-file string pattern file (one record per line)
-R, --region string specify sequence region for searching. e.g 1:12 for first 12 bases,
-12:-1 for last 12 bases
-r, --use-regexp patterns are regular expression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
seqkit grep -f extract.id AARS.lncRPI.fasta -o extract.fasta
```

### split
Split sequences into files by name ID, subsequence of given region,
part size or number of parts.
>Usage:
> seqkit split [flags]
>
>Flags:
> -i, --by-id split squences according to sequence ID
--by-id-prefix string file prefix for --by-id
-p, --by-part int split sequences into N parts
--by-part-prefix string file prefix for --by-part
-r, --by-region string split squences according to subsequence of given region. e.g 1:12 for
first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for
more examples
--by-region-prefix string file prefix for --by-region
-s, --by-size int split sequences into multi parts with N sequences
--by-size-prefix string file prefix for --by-size
-d, --dry-run dry run, just print message and no files will be created.
-e, --extension string set output file extension, e.g., ".gz", ".xz", or ".zst"
-f, --force overwrite output directory
-h, --help help for split
-k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
-O, --out-dir string output directory (default value is $infile.split)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA
format)
-U, --update-faidx update the fasta index file if it exists. Use this if you are not sure
whether the fasta file changed


```bash
# 分割fasta,每个fasta 1000条序列
seqkit split AARS.lncRPI.fasta -s 1000
[INFO] split into 1000 seqs per file
[INFO] write 1000 sequences to file: AARS.lncRPI.fasta.split/AARS.lncRPI.part_001.fasta
[INFO] write 317 sequences to file: AARS.lncRPI.fasta.split/AARS.lncRPI.part_002.fasta
seqkit stats -a -T AARS.lncRPI.fasta.split/* | csvlook

shuffle

Shuffle sequences

Usage:
seqkit shuffle [flags]

Flags:
-h, --help help for shuffle
-k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
-s, --rand-seed int rand seed for shuffle (default 23)
-2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
-U, --update-faidx update the fasta index file if it exists. Use this if you are not sure whether
the fasta file changed

1
2
3
4
5
seqkit shuffle AARS.lncRPI.fasta > shuffle.fasta
[INFO] read sequences ...
[INFO] 1317 sequences loaded
[INFO] shuffle ...
[INFO] output ...

Reference

  1. SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation