Skip to content

sg.config configuration (The homology of the de novo assembled genome is not known) #4

Open
@hushaoqiang

Description

@hushaoqiang

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.

Activity

zhangrengang

zhangrengang commented on Apr 28, 2022

@zhangrengang
Owner

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...
zhangrengang

zhangrengang commented on Apr 28, 2022

@zhangrengang
Owner

Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like:
hic-heatmap
So the sg.config configuration can be set like:

chr01a chr01b chr01c chr01d
chr02a chr02b chr02c chr02d
...

Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.

hushaoqiang

hushaoqiang commented on Apr 28, 2022

@hushaoqiang
Author

谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。
老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容:
Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4
Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4
Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4
Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4
Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4
Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4
Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4

运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config
但是会出现:
22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%)
22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich
22-04-28 18:21:39 [INFO] ###Step: LTR
22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest']
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok exists; skip this step
22-04-28 18:21:45 [INFO] 8186 LTRs identified
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok exists; skip this step
22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains
22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained
22-04-28 18:21:45 [INFO] Outputing coordinate - LTR maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count
22-04-28 18:21:45 [INFO] Start Pool with 32 process(es)
22-04-28 18:21:45 [INFO] Processed 0 sequences
Traceback (most recent call last):
File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr
Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu,
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3
logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i))
ZeroDivisionError: division by zero

k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。
请问一下老师我是不是需要调整参数或者其他设置有什么问题。
期待老师您的回复

zhangrengang

zhangrengang commented on Apr 28, 2022

@zhangrengang
Owner

Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.

hushaoqiang

hushaoqiang commented on Apr 28, 2022

@hushaoqiang
Author

老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。

zhangrengang

zhangrengang commented on Apr 28, 2022

@zhangrengang
Owner

F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.

hushaoqiang

hushaoqiang commented on Apr 28, 2022

@hushaoqiang
Author

老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。

zhangrengang

zhangrengang commented on Apr 28, 2022

@zhangrengang
Owner

F_ana_Camarosa_6-28-17_unmasked.fasta.gz

hushaoqiang

hushaoqiang commented on Apr 28, 2022

@hushaoqiang
Author

老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。

hushaoqiang

hushaoqiang commented on Apr 30, 2022

@hushaoqiang
Author

Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good.
Thank you!

zhangrengang

zhangrengang commented on Apr 30, 2022

@zhangrengang
Owner

Great. Thanks for your feedback.

kashiff007

kashiff007 commented on Jun 5, 2022

@kashiff007

Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.

But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.
image
The differential k-mer heatmap also shows the similar result:
image

Can you suggest why it happend? Am I approaching rightly?

zhangrengang

zhangrengang commented on Jun 5, 2022

@zhangrengang
Owner

@kashiff007 It seems that your genome is contig-level, but not chromosome-level?

kashiff007

kashiff007 commented on Jun 5, 2022

@kashiff007

Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.

zhangrengang

zhangrengang commented on Jun 5, 2022

@zhangrengang
Owner

@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?

9 remaining items

hungweichen0327

hungweichen0327 commented on Dec 28, 2024

@hungweichen0327

Hello,

Thank you for the useful SubPhaser software.

image

This is the HiC contact heatmap for the plant tetraploid genome (40 pseduo-chromosomes). Since it's hard to identify the homology of this genome using the Hi-C heatmap and there is no complete genome of a closed related species, any other suggestions to identify the homology of this genome?

Thank you for the help.

zhangrengang

zhangrengang commented on Dec 28, 2024

@zhangrengang
Owner

@hungweichen0327 You can try self alignment using aligner such as minimap2:

minimap2 genome.fa genome.fa -D -n 12 -f200 -p 0.2 -x asm20 > self.paf

The paf file can be visualized by various tools such as minidot.

hungweichen0327

hungweichen0327 commented on Dec 28, 2024

@hungweichen0327

Dear @zhangrengang ,

Thank you for the clear and quick reply. I will try it and upload the minimap2 result soon.

hungweichen0327

hungweichen0327 commented on Dec 28, 2024

@hungweichen0327

image
This is the minimap2 result using the suggested minimap2 script.

Do you have any suggestions about adjusting the minimap2 parameters to improve the self-alignment result?
Do you need any other information? I would appreciate any comments about getting a haplotype-phased genome using SubPhaser.

zhangrengang

zhangrengang commented on Dec 29, 2024

@zhangrengang
Owner

The relationships appear to be clear, but there are noisies derived from an older WGD event. You can try the following parameters to reduce the noisies:

minimap2 genome.fa genome.fa -D -x asm10 > self.paf

Alternatively, you can try our SOI pipeline (https://github.com/zhangrengang/soi), so that synteny blocks derived from different WGD events can be distinguished by different Ks peaks of OI peaks.

hungweichen0327

hungweichen0327 commented on Dec 29, 2024

@hungweichen0327

Thank you for the suggestions.

This is the minimap2 result using the -D -x asm10 parameters.
yahs_final_revised out_JBAT FINAL self_alignment_v2

For the SOI pipeline, I quickly viewed the contents and it seems that gff file of the genome assembly is required. Currently, I have not conducted gene prediction for this genome. So I can not run the SOI so far.

zhangrengang

zhangrengang commented on Dec 29, 2024

@zhangrengang
Owner

The homoeologous relationships (reciprocal best alignment between two chromosomes) seem to be clear, I think you can make the config file accordingly. If there are some ambiguous relationships, the corresponding chromosomes can be left as singletons. Also you can try -D -x asm5 for minimap2 to seek more clear relationships. Or you can search some tools to visualize the alignment similarities for minimap2 output.

For the SOI pipeline, indeed gene annotation is needed, but at this stage, high quality and completeness are not required for the annotation. So you may make a fast annotation using tools such as PASA, augustus, metaeuk, etc.

hungweichen0327

hungweichen0327 commented on Dec 29, 2024

@hungweichen0327

I appreciate your quick and kind help. I will make the config accordingly.

For the SOI pipeline, I will make a fast annotation and run SOI soon.

hungweichen0327

hungweichen0327 commented on Dec 30, 2024

@hungweichen0327

Hi @zhangrengang,

I have additional questions related to the config file.
Based on the dotplot of the self-alignment, there are still six chromosomes that have ambiguous relationships. Thus, the config file is shown below:

scaffold_1 scaffold_29
scaffold_2 scaffold_24
scaffold_3 scaffold_4
scaffold_5 scaffold_9
scaffold_6 scaffold_8
scaffold_7 scaffold_13
scaffold_10 scaffold_18
scaffold_11 scaffold_15
scaffold_12 scaffold_17
scaffold_14 scaffold_16
scaffold_19 scaffold_20
scaffold_21 scaffold_25
scaffold_22 
scaffold_23 scaffold_36
scaffold_26 
scaffold_27 
scaffold_28 scaffold_33
scaffold_30 
scaffold_31 scaffold_34
scaffold_32 scaffold_37
scaffold_35 
scaffold_38 scaffold_39
scaffold_40 

Q1: is the config file format correct?

Q2: I expected that this is the allotetraploid species. The genome would be aa'bb'. Is it okay if I use the above config to run SubPhaser with default options? (subphaser -i genome.fasta.gz -c sg.config) Ideally, I should identify the homology of the 4 chromosomes
e.g.,

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4

Thank you.

zhangrengang

zhangrengang commented on Dec 30, 2024

@zhangrengang
Owner
  1. The format is correct.
  2. Have you assembled all the four chromosomes aa'bb', or only the two chromosomes ab? In the later case, the config file is ok.
hungweichen0327

hungweichen0327 commented on Dec 30, 2024

@hungweichen0327

I assembled the four chromosomes set aa'bb' since its chromosome number is 2x=4n=40 and I assembled the 40 chromosomes as shown in the dotplot and Hi-C contact heatmap.
However, I know both dotplot and Hi-C heatmap did not represent the patterns similar to your example output above. Thus, I am not sure how to resolve this issue so far.

PS: Based on the Genomescope prediction using k-mer frequency distribution, the size of this genome assembly is reliable (1.2 Gb) because the predicted size of haploid genome is about 0.3 Gb.

zhangrengang

zhangrengang commented on Dec 30, 2024

@zhangrengang
Owner

You need to seek the relationships between a and b (using SOI for example), and use the config file like:
a a' b b' (subphaser parameter: -nsg 2 -baseline 2),
or
a b (default parameter).

However, the subgenome size is quite small, and there are likely too few subgenome-specific repeats for subphaser (try -q 50).

hungweichen0327

hungweichen0327 commented on Dec 31, 2024

@hungweichen0327

Thank you for the reply.
I will try SOI to seek homology relationships of this tetraploid genome since the self-alignment result (dotplot) can not identify the 4 chromosome sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @kashiff007@zhangrengang@kedduck@hungweichen0327@hushaoqiang

        Issue actions

          sg.config configuration (The homology of the de novo assembled genome is not known) · Issue #4 · zhangrengang/SubPhaser