Open
Description
Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.
Activity
zhangrengang commentedon Apr 28, 2022
No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:
So the sg.config configuration can be set like:
zhangrengang commentedon Apr 28, 2022
Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like:

So the sg.config configuration can be set like:
Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.
hushaoqiang commentedon Apr 28, 2022
谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。
老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容:
Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4
Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4
Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4
Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4
Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4
Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4
Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4
运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config
但是会出现:
22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%)
22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich
22-04-28 18:21:39 [INFO] ###Step: LTR
22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest']
22-04-28 18:21:45 [INFO] Check point file:
/home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok
exists; skip this step22-04-28 18:21:45 [INFO] 8186 LTRs identified
22-04-28 18:21:45 [INFO] Check point file:
/home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok
exists; skip this step22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains
22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained
22-04-28 18:21:45 [INFO] Outputing
coordinate
-LTR
maps to/home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count
22-04-28 18:21:45 [INFO] Start Pool with 32 process(es)
22-04-28 18:21:45 [INFO] Processed 0 sequences
Traceback (most recent call last):
File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr
Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu,
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3
logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i))
ZeroDivisionError: division by zero
k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。
请问一下老师我是不是需要调整参数或者其他设置有什么问题。
期待老师您的回复
zhangrengang commentedon Apr 28, 2022
Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.
hushaoqiang commentedon Apr 28, 2022
老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。
zhangrengang commentedon Apr 28, 2022
F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.
hushaoqiang commentedon Apr 28, 2022
老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。
zhangrengang commentedon Apr 28, 2022
F_ana_Camarosa_6-28-17_unmasked.fasta.gz
hushaoqiang commentedon Apr 28, 2022
老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。
hushaoqiang commentedon Apr 30, 2022
Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good.
Thank you!
zhangrengang commentedon Apr 30, 2022
Great. Thanks for your feedback.
kashiff007 commentedon Jun 5, 2022
Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.
But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.


The differential k-mer heatmap also shows the similar result:
Can you suggest why it happend? Am I approaching rightly?
zhangrengang commentedon Jun 5, 2022
@kashiff007 It seems that your genome is contig-level, but not chromosome-level?
kashiff007 commentedon Jun 5, 2022
Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.
zhangrengang commentedon Jun 5, 2022
@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?
9 remaining items
hungweichen0327 commentedon Dec 28, 2024
Hello,
Thank you for the useful SubPhaser software.
This is the HiC contact heatmap for the plant tetraploid genome (40 pseduo-chromosomes). Since it's hard to identify the homology of this genome using the Hi-C heatmap and there is no complete genome of a closed related species, any other suggestions to identify the homology of this genome?
Thank you for the help.
zhangrengang commentedon Dec 28, 2024
@hungweichen0327 You can try self alignment using aligner such as minimap2:
The
paf
file can be visualized by various tools such as minidot.hungweichen0327 commentedon Dec 28, 2024
Dear @zhangrengang ,
Thank you for the clear and quick reply. I will try it and upload the minimap2 result soon.
hungweichen0327 commentedon Dec 28, 2024
This is the minimap2 result using the suggested minimap2 script.
Do you have any suggestions about adjusting the minimap2 parameters to improve the self-alignment result?
Do you need any other information? I would appreciate any comments about getting a haplotype-phased genome using SubPhaser.
zhangrengang commentedon Dec 29, 2024
The relationships appear to be clear, but there are noisies derived from an older WGD event. You can try the following parameters to reduce the noisies:
Alternatively, you can try our SOI pipeline (https://github.com/zhangrengang/soi), so that synteny blocks derived from different WGD events can be distinguished by different Ks peaks of OI peaks.
hungweichen0327 commentedon Dec 29, 2024
Thank you for the suggestions.
This is the minimap2 result using the -D -x asm10 parameters.

For the SOI pipeline, I quickly viewed the contents and it seems that gff file of the genome assembly is required. Currently, I have not conducted gene prediction for this genome. So I can not run the SOI so far.
zhangrengang commentedon Dec 29, 2024
The homoeologous relationships (reciprocal best alignment between two chromosomes) seem to be clear, I think you can make the config file accordingly. If there are some ambiguous relationships, the corresponding chromosomes can be left as singletons. Also you can try
-D -x asm5
for minimap2 to seek more clear relationships. Or you can search some tools to visualize the alignment similarities for minimap2 output.For the SOI pipeline, indeed gene annotation is needed, but at this stage, high quality and completeness are not required for the annotation. So you may make a fast annotation using tools such as PASA, augustus, metaeuk, etc.
hungweichen0327 commentedon Dec 29, 2024
I appreciate your quick and kind help. I will make the config accordingly.
For the SOI pipeline, I will make a fast annotation and run SOI soon.
hungweichen0327 commentedon Dec 30, 2024
Hi @zhangrengang,
I have additional questions related to the config file.
Based on the dotplot of the self-alignment, there are still six chromosomes that have ambiguous relationships. Thus, the config file is shown below:
Q1: is the config file format correct?
Q2: I expected that this is the allotetraploid species. The genome would be aa'bb'. Is it okay if I use the above config to run SubPhaser with default options? (subphaser -i genome.fasta.gz -c sg.config) Ideally, I should identify the homology of the 4 chromosomes
e.g.,
Thank you.
zhangrengang commentedon Dec 30, 2024
hungweichen0327 commentedon Dec 30, 2024
I assembled the four chromosomes set aa'bb' since its chromosome number is 2x=4n=40 and I assembled the 40 chromosomes as shown in the dotplot and Hi-C contact heatmap.
However, I know both dotplot and Hi-C heatmap did not represent the patterns similar to your example output above. Thus, I am not sure how to resolve this issue so far.
PS: Based on the Genomescope prediction using k-mer frequency distribution, the size of this genome assembly is reliable (1.2 Gb) because the predicted size of haploid genome is about 0.3 Gb.
zhangrengang commentedon Dec 30, 2024
You need to seek the relationships between a and b (using SOI for example), and use the config file like:
a a' b b'
(subphaser parameter:-nsg 2 -baseline 2
),or
a b
(default parameter).However, the subgenome size is quite small, and there are likely too few subgenome-specific repeats for subphaser (try
-q 50
).hungweichen0327 commentedon Dec 31, 2024
Thank you for the reply.
I will try SOI to seek homology relationships of this tetraploid genome since the self-alignment result (dotplot) can not identify the 4 chromosome sets.