geneblastG_extension primarily addresses the following issues:
- Wraps the geneblastG software to enable multithreaded operation, enhancing computational efficiency.
- Corrects the phase of CDS in GFF3 files and reconstructs them into standard GFF3 format.
- Based on predicted gene chromosomal positions, it retains the gene model with the optimal score and length to eliminate redundancy.
- The provided diff_gff3.py script enables comparison between high-quality gene model and the gene model predicted by geneblastG_extension, facilitating gene family identification tasks.
python >= 3.5.
The biopython and natsort package must be installed.
The pipline is base on genblastG (v1.38) software and genblastg_patch patch.
$ git clone git@github.com:thecgs/genblastG_extension.git
$ pip install biopython
$ pip install natsort
$ genblastG_extension.py -h
usage: genblastG_extension.py -g str -q str [-p str] [-c float] [-gap] [-e str] [-G int] [-t int] [-h] [-v]
This script is mainly the wrapper of genblastg software.
It can run genblastg in multiple threads, reconstruct the results, and output the standard gff3 file.
Secondly, it can filter the redundant gene model according to the prediction score and the length of
the predicted gene to generate the best non redundant gene model.
required arguments:
-g str, --genome str A file of genome fasta format.
-q str, --query str A file of query protein fasta format.
optional arguments:
-p str, --prefix str A prefix of output. default=genblastG_extension
-c float, --query_cover float
minimum query cover (0-1) to report an alignment. defualt=0.8
-gap, --gap parameter for blast: Perform gapped alignment. default=False
-e str, --evalue str A maximum evalue of report alignments. defualt=1e-5
-G int, --genetic_code int
Genetic code. default=1
-t int, --thread int Thread number of single sortware. defualt=16
-h, --help Show this help message and exit.
-v, --version Show program's version number and exit.
Date:2025/09/24 Author:Guisen Chen Email:thecgs001@foxmail.com
$ cd ./example
$ ../genblastG_extension.py -g genome.fa -q seq.fa
$ cat genblastG_extension.filtered.gff3
scaffold292 genBlastG gene 151014 161317 96.055 + . ID=GENE1_NP_001119935.1.1-R1-1-A1;Target=NP_001119935.1.1;
scaffold292 genBlastG mRNA 151014 161317 . + . ID=mrna.GENE1_NP_001119935.1.1-R1-1-A1;Parent=GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 151014 151347 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 151014 151347 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 159761 159975 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 159761 159975 . + 2 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 160161 160280 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 160161 160280 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 161186 161317 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 161186 161317 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold490 genBlastG gene 530885 536012 79.581 - . ID=GENE2_NP_001092224.1.2-R1-1-A1;Target=NP_001092224.1.2;
scaffold490 genBlastG mRNA 530885 536012 . - . ID=mrna.GENE2_NP_001092224.1.2-R1-1-A1;Parent=GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 535679 536012 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 535679 536012 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 531679 531896 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 531679 531896 . - 2 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 531127 531246 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 531127 531246 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 530885 531022 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 530885 531022 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
$cat genblastG_extension.pep.fasta
>mrna.NP_001119935.1-R1-1-A1 Gene=NP_001119935.1-R1-1-A1 Length=266 Position=scaffold292:151014-161317(+)
MILSFCLLVTFILGLSGCLGSYVPCEPCDEKAMSMCPPVPVGCQLVKEPGCGCCLTCALSEGQACGVYTGTCTQGLRCLPRSGEEKPLHALLHGRGVCTNEKGYKPAHPPIDRESREHEDTMTTEITEELQPAKVPLLPKDIVNSKKVHALRKEQKRKLGKQRYMGSPMDYSPLPIDKHEPEFGPCRRKLDGIIQGMKDTSRVMALSLYLPNCDRKGFFKRKQCKPSRGRKRGICWCVDKYGIQLPGTDYSGGDIQCKDLESSNNE*
>mrna.NP_001092224.1-R2-2-A1 Gene=NP_001092224.1-R2-2-A1 Length=266 Position=scaffold490:530885-536012(-)
MLLSVSLLVLPLLSFPGCGSSYVPCEPCDQKAQSMCPPVPMGCQLVKEPGCGCCLTCALEEGQPCGVYTGPCTRGLRCLPKNGEEKPLHALLHGRGVCRNEKLYKLLHPSKDESHDDTLLPVPESMLPQTKVPLYGRDHISSRKVHAMKQAKDRKKQLARLGPASNLDFSPLSLDKMDPEFGPCRRRLDNLIQSMKDTSRVLALSLYIPNCDKKGFFKRKQCKPSRGRKRGICWCVDRFGVKIPGINYAGGDLQCKDLDSSSNSNE*
$ cat genblastG_extension.cds.fasta
>mrna.NP_001119935.1-R1-1-A1 Gene=NP_001119935.1-R1-1-A1 Length=801 Position=scaffold292:151014-161317(+)
ATGATTCTGAGTTTTTGCCTCTTGGTGACATTTATCTTGGGGCTGTCCGGCTGCTTGGGCTCATACGTGCCGTGCGAGCCTTGTGACGAGAAGGCGATGTCCATGTGCCCTCCGGTCCCGGTCGGATGCCAGCTGGTCAAGGAGCCGGGCTGCGGCTGCTGCCTAACGTGTGCCCTGTCTGAGGGGCAGGCGTGCGGCGTTTACACCGGGACGTGCACCCAGGGCCTGCGCTGCCTGCCGAGGAGCGGGGAGGAGAAACCCCTGCACGCCCTTCTCCACGGCAGGGGAGTGTGCACCAACGAGAAAGGATACAAACCTGCCCACCCGCCCATAGATCGTGAGTCTCGAGAACATGAGGACACCATGACCACAGAGATTACAGAGGAGTTGCAGCCAGCCAAAGTGCCGCTCCTTCCTAAAGACATTGTGAACAGTAAAAAAGTCCATGCGCTGCGCAAGGAGCAAAAGAGGAAGCTGGGCAAGCAGCGCTACATGGGCTCTCCTATGGACTATTCCCCTCTGCCCATCGACAAGCATGAGCCTGAATTTGGTCCATGCAGAAGAAAACTGGATGGGATCATTCAGGGGATGAAGGACACTTCTCGTGTAATGGCTCTGTCTTTGTACCTCCCCAACTGCGACAGAAAAGGATTCTTCAAGCGCAAGCAGTGTAAACCATCTCGCGGCCGCAAACGAGGCATCTGCTGGTGCGTGGACAAGTACGGCATCCAGCTCCCCGGCACAGACTACAGCGGAGGGGACATTCAGTGTAAAGACCTGGAGAGCAGCAACAACGAGTGA
>mrna.NP_001092224.1-R2-2-A1 Gene=NP_001092224.1-R2-2-A1 Length=801 Position=scaffold490:530885-536012(-)
ATGCTGCTGAGTGTTTCCCTCCTGGTGCTCCCCCTGCTTAGCTTCCCCGGCTGCGGCTCGTCGTACGTGCCGTGCGAGCCGTGCGATCAGAAGGCCCAGTCCATGTGCCCGCCGGTGCCGATGGGCTGTCAGCTGGTGAAGGAGCCCGGCTGCGGCTGCTGCCTGACGTGCGCGCTCGAAGAGGGCCAGCCGTGCGGCGTGTACACCGGGCCGTGCACCCGCGGGCTCCGGTGCCTCCCGAAGAACGGCGAGGAGAAGCCGCTGCATGCCCTGCTGCACGGCCGGGGGGTGTGCAGGAACGAGAAGTTGTACAAACTGCTGCATCCGTCAAAAGACGAATCTCACGATGACACCCTGCTGCCCGTCCCTGAGTCAATGCTGCCGCAAACCAAGGTGCCCTTATATGGAAGAGACCACATCAGCAGTCGGAAGGTCCACGCCATGAAGCAAGCCAAGGACCGCAAGAAGCAGCTGGCCAGGTTGGGACCTGCCAGCAACCTGGACTTCTCACCGCTAAGCCTGGATAAAATGGATCCCGAGTTCGGGCCCTGCAGGAGAAGATTGGACAATCTCATCCAGAGCATGAAAGACACCTCTCGGGTCTTGGCTCTCTCTCTGTACATCCCCAACTGTGACAAGAAGGGCTTCTTCAAGCGCAAACAGTGTAAGCCGTCTCGTGGACGAAAAAGGGGCATCTGCTGGTGCGTCGACCGGTTTGGCGTGAAAATCCCAGGCATCAACTACGCCGGCGGAGACCTGCAGTGCAAGGATCTCGACAGCAGCAGCAACAGCAATGAATGA
$cat genblastG_extension.raw.gff3
scaffold292 genBlastG gene 151014 161317 96.055 + . ID=GENE1_NP_001119935.1.1-R1-1-A1;Target=NP_001119935.1.1;
scaffold292 genBlastG mRNA 151014 161317 . + . ID=mrna.GENE1_NP_001119935.1.1-R1-1-A1;Parent=GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 151014 151347 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 151014 151347 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 159761 159975 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 159761 159975 . + 2 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 160161 160280 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 160161 160280 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG exon 161186 161317 . + . ID=exon.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold292 genBlastG CDS 161186 161317 . + 0 ID=cds.GENE1_NP_001119935.1.1-R1-1-A1;Parent=mrna.GENE1_NP_001119935.1.1-R1-1-A1;
scaffold490 genBlastG gene 530885 536012 79.581 - . ID=GENE2_NP_001092224.1.2-R1-1-A1;Target=NP_001092224.1.2;
scaffold490 genBlastG mRNA 530885 536012 . - . ID=mrna.GENE2_NP_001092224.1.2-R1-1-A1;Parent=GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 535679 536012 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 535679 536012 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 531679 531896 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 531679 531896 . - 2 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 531127 531246 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 531127 531246 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG exon 530885 531022 . - . ID=exon.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold490 genBlastG CDS 530885 531022 . - 0 ID=cds.GENE2_NP_001092224.1.2-R1-1-A1;Parent=mrna.GENE2_NP_001092224.1.2-R1-1-A1;
scaffold292 genBlastG gene 151014 161317 77.4653 + . ID=GENE3_NP_001092224.1.2-R2-2-A1;Target=NP_001092224.1.2;
scaffold292 genBlastG mRNA 151014 161317 . + . ID=mrna.GENE3_NP_001092224.1.2-R2-2-A1;Parent=GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG exon 151014 151347 . + . ID=exon.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG CDS 151014 151347 . + 0 ID=cds.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG exon 159761 159975 . + . ID=exon.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG CDS 159761 159975 . + 2 ID=cds.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG exon 160161 160280 . + . ID=exon.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG CDS 160161 160280 . + 0 ID=cds.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG exon 161186 161317 . + . ID=exon.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;
scaffold292 genBlastG CDS 161186 161317 . + 0 ID=cds.GENE3_NP_001092224.1.2-R2-2-A1;Parent=mrna.GENE3_NP_001092224.1.2-R2-2-A1;