This python script can be used to detect Whole-genome duplication (WGD) with the dS-based method.
- Caculate dS values based on gene family data (Paralogs dS values)
python GenoDup.py -s Nuclear_sequence_file -p Protein_sequence_file -g Gene_family_file -n Maximum_number_of_gene_family
- Caculate dS values based on anchor gene pair data (Anchor dS values)
python GenoDup.py -s Nuclear_sequence_file -p Protein_sequence_file -c Gene_pair_file
- Nuclear_sequence_file: it contains all the nuclear sequences in your analysis (fasta format).
eg:
>gene1
ATCG
>gene2
ATCC
...
- Protein_sequence_file: it contains all the protein sequences in your analysis (fasta format).
eg:
>gene1
PAPA
>gene2
PAPA
...
- Gene_family_file: it contains the gene family cluster (usually be produced by OrthoMCL).
eg:
led1: gene1,gene2,gene3
led2: gene3,gene4
...
- Gene_pair_file: it contains two Ohnologs in two colums separated by tab (could be produced by MCScanX or OrthoMCL or i-ADHoRe).
eg:
gene1 gene2
gene3 gene4
...
- Maximum_number_of_gene_family: Maximum number of gene family which you want to analyze, only use with -g (suggest: 5-15)
- pairwise directory: including all gene pair sequences (fasta format).
- PAML_result: including all codeml output files.
- dS_value.txt: including all results of dS values generated by codeml.
Rscript plot_Genodup.r
1.Mao, Yafei. "GenoDup Pipeline: a tool to detect genome duplication using the dS-based method." PeerJ 7 (2019): e6303.
2.Abascal, Federico, Rafael Zardoya, and Maximilian J. Telford. "TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations." Nucleic acids research 38.suppl_2 (2010): W7-W13.
3.Katoh, Kazutaka, and Daron M. Standley. "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Molecular biology and evolution 30.4 (2013): 772-780.
4.Yang, Ziheng. "PAML: a program package for phylogenetic analysis by maximum likelihood." Bioinformatics 13.5 (1997): 555-556.
Please read below literature for basic knowledge of dS value calculation for WGD inference:
1.Vanneste, Kevin, et al. "Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous–Paleogene boundary." Genome research 24.8 (2014): 1334-1347.
2.Berthelot, Camille, et al. "The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates." Nature communications 5 (2014): 3657.
3.Vanneste, Kevin, Yves Van de Peer, and Steven Maere. "Inference of genome duplications from age distributions revisited." Molecular biology and evolution 30.1 (2012): 177-190.