sg.config configuration (The homology of the de novo assembled genome is not known) #4

Open

sg.config configuration (The homology of the de novo assembled genome is not known)#4

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.

zhangrengang

Owner

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...

zhangrengang

Owner

Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like:

So the sg.config configuration can be set like:

chr01a chr01b chr01c chr01d
chr02a chr02b chr02c chr02d
...

Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.

hushaoqiang

Author

谢谢老师您的解答，我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您，为了说详细点，我就是用中文了。
老师，使用Camarosa草莓的基因组序列想要试运行一下，基因组文件是发表的序列。配置文件内容：
Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4
Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4
Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4
Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4
Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4
Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4
Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4

运行命令是：subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config
但是会出现：
22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%)
22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich
22-04-28 18:21:39 [INFO] ###Step: LTR
22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest']
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok exists; skip this step
22-04-28 18:21:45 [INFO] 8186 LTRs identified
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok exists; skip this step
22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains
22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained
22-04-28 18:21:45 [INFO] Outputing coordinate - LTR maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count
22-04-28 18:21:45 [INFO] Start Pool with 32 process(es)
22-04-28 18:21:45 [INFO] Processed 0 sequences
Traceback (most recent call last):
File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr
Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu,
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3
logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i))
ZeroDivisionError: division by zero

k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。
请问一下老师我是不是需要调整参数或者其他设置有什么问题。
期待老师您的回复

zhangrengang

Owner

Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.

hushaoqiang

Author

老师，请问您对开麦罗莎草莓做分析时（Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.），使用的文件是哪一个啊。

zhangrengang

Owner

F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.

hushaoqiang

Author

老师，我刚刚又去GDR网站看了一下，我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件，请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊，麻烦老师您了。

zhangrengang

Owner

F_ana_Camarosa_6-28-17_unmasked.fasta.gz

hushaoqiang

Author

老师，已成功运行，4套亚基因组成功分型。图非常好看，感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。

hushaoqiang

Author

Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good.
Thank you!

zhangrengang

Owner

Great. Thanks for your feedback.

kashiff007

Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.

But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.

The differential k-mer heatmap also shows the similar result:

Can you suggest why it happend? Am I approaching rightly?

zhangrengang

Owner

@kashiff007 It seems that your genome is contig-level, but not chromosome-level?

kashiff007

Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.

zhangrengang

Owner

@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?

9 remaining items

zhangrengang

mentioned this

on Oct 2, 2024

同源染色体组 #39

hungweichen0327

Hello,

Thank you for the useful SubPhaser software.

This is the HiC contact heatmap for the plant tetraploid genome (40 pseduo-chromosomes). Since it's hard to identify the homology of this genome using the Hi-C heatmap and there is no complete genome of a closed related species, any other suggestions to identify the homology of this genome?

Thank you for the help.

zhangrengang

Owner

@hungweichen0327 You can try self alignment using aligner such as minimap2:

minimap2 genome.fa genome.fa -D -n 12 -f200 -p 0.2 -x asm20 > self.paf

The paf file can be visualized by various tools such as minidot.

hungweichen0327

Dear @zhangrengang ,

Thank you for the clear and quick reply. I will try it and upload the minimap2 result soon.

hungweichen0327

This is the minimap2 result using the suggested minimap2 script.

Do you have any suggestions about adjusting the minimap2 parameters to improve the self-alignment result?
Do you need any other information? I would appreciate any comments about getting a haplotype-phased genome using SubPhaser.

zhangrengang

Owner

The relationships appear to be clear, but there are noisies derived from an older WGD event. You can try the following parameters to reduce the noisies:

minimap2 genome.fa genome.fa -D -x asm10 > self.paf

Alternatively, you can try our SOI pipeline (https://github.com/zhangrengang/soi), so that synteny blocks derived from different WGD events can be distinguished by different Ks peaks of OI peaks.

hungweichen0327

Thank you for the suggestions.

This is the minimap2 result using the -D -x asm10 parameters.

For the SOI pipeline, I quickly viewed the contents and it seems that gff file of the genome assembly is required. Currently, I have not conducted gene prediction for this genome. So I can not run the SOI so far.

zhangrengang

Owner

The homoeologous relationships (reciprocal best alignment between two chromosomes) seem to be clear, I think you can make the config file accordingly. If there are some ambiguous relationships, the corresponding chromosomes can be left as singletons. Also you can try -D -x asm5 for minimap2 to seek more clear relationships. Or you can search some tools to visualize the alignment similarities for minimap2 output.

For the SOI pipeline, indeed gene annotation is needed, but at this stage, high quality and completeness are not required for the annotation. So you may make a fast annotation using tools such as PASA, augustus, metaeuk, etc.

hungweichen0327

I appreciate your quick and kind help. I will make the config accordingly.

For the SOI pipeline, I will make a fast annotation and run SOI soon.

hungweichen0327

Hi @zhangrengang,

I have additional questions related to the config file.
Based on the dotplot of the self-alignment, there are still six chromosomes that have ambiguous relationships. Thus, the config file is shown below:

scaffold_1 scaffold_29
scaffold_2 scaffold_24
scaffold_3 scaffold_4
scaffold_5 scaffold_9
scaffold_6 scaffold_8
scaffold_7 scaffold_13
scaffold_10 scaffold_18
scaffold_11 scaffold_15
scaffold_12 scaffold_17
scaffold_14 scaffold_16
scaffold_19 scaffold_20
scaffold_21 scaffold_25
scaffold_22 
scaffold_23 scaffold_36
scaffold_26 
scaffold_27 
scaffold_28 scaffold_33
scaffold_30 
scaffold_31 scaffold_34
scaffold_32 scaffold_37
scaffold_35 
scaffold_38 scaffold_39
scaffold_40

Q1: is the config file format correct?

Q2: I expected that this is the allotetraploid species. The genome would be aa'bb'. Is it okay if I use the above config to run SubPhaser with default options? (subphaser -i genome.fasta.gz -c sg.config) Ideally, I should identify the homology of the 4 chromosomes
e.g.,

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4

Thank you.

zhangrengang

Owner

The format is correct.
Have you assembled all the four chromosomes aa'bb', or only the two chromosomes ab? In the later case, the config file is ok.

hungweichen0327

I assembled the four chromosomes set aa'bb' since its chromosome number is 2x=4n=40 and I assembled the 40 chromosomes as shown in the dotplot and Hi-C contact heatmap.
However, I know both dotplot and Hi-C heatmap did not represent the patterns similar to your example output above. Thus, I am not sure how to resolve this issue so far.

PS: Based on the Genomescope prediction using k-mer frequency distribution, the size of this genome assembly is reliable (1.2 Gb) because the predicted size of haploid genome is about 0.3 Gb.

zhangrengang

Owner

You need to seek the relationships between a and b (using SOI for example), and use the config file like:
a a' b b' (subphaser parameter: -nsg 2 -baseline 2),
or
a b (default parameter).

However, the subgenome size is quite small, and there are likely too few subgenome-specific repeats for subphaser (try -q 50).

hungweichen0327

Thank you for the reply.
I will try SOI to seek homology relationships of this tetraploid genome since the self-alignment result (dotplot) can not identify the 4 chromosome sets.

hungweichen0327

mentioned this

on Feb 10, 2025

suggestions of the script to identify the homology in tetraploid species zhangrengang/SOI#2

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sg.config configuration (The homology of the de novo assembled genome is not known) #4

9 remaining items

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

sg.config configuration (The homology of the de novo assembled genome is not known) #4

Description

Activity

zhangrengang commented on Apr 28, 2022

zhangrengang commented on Apr 28, 2022

hushaoqiang commented on Apr 28, 2022

zhangrengang commented on Apr 28, 2022

hushaoqiang commented on Apr 28, 2022

zhangrengang commented on Apr 28, 2022

hushaoqiang commented on Apr 28, 2022

zhangrengang commented on Apr 28, 2022

hushaoqiang commented on Apr 28, 2022

hushaoqiang commented on Apr 30, 2022

zhangrengang commented on Apr 30, 2022

kashiff007 commented on Jun 5, 2022

zhangrengang commented on Jun 5, 2022

kashiff007 commented on Jun 5, 2022

zhangrengang commented on Jun 5, 2022

9 remaining items

hungweichen0327 commented on Dec 28, 2024

zhangrengang commented on Dec 28, 2024

hungweichen0327 commented on Dec 28, 2024

hungweichen0327 commented on Dec 28, 2024

zhangrengang commented on Dec 29, 2024

hungweichen0327 commented on Dec 29, 2024

zhangrengang commented on Dec 29, 2024

hungweichen0327 commented on Dec 29, 2024

hungweichen0327 commented on Dec 30, 2024

zhangrengang commented on Dec 30, 2024

hungweichen0327 commented on Dec 30, 2024

zhangrengang commented on Dec 30, 2024

hungweichen0327 commented on Dec 31, 2024

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Participants

Issue actions