Skip to content

Files

Latest commit

b45408b · Jun 30, 2023

History

History

00_data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 18, 2022
Nov 18, 2022
Nov 18, 2022
Nov 18, 2022
Nov 18, 2022
Nov 18, 2022
Jul 15, 2022
Nov 18, 2022
Jul 15, 2022
Nov 18, 2022
Nov 18, 2022
Nov 18, 2022
Jun 30, 2023

Obtaining and parsing molecular data

1. Downloading sequence data

We are using the "dataset 1" described in Hou et al. 2007. This study only lists the accession numbers of the myxovirus genes used in the "Materials and Methods" section, but does not provide the reader with the alignments they used. Consequently, we decided to download the sequences using the given accession numbers and parsed the data following the procedure described in Hou et al. 2007, whenever possible.

Below, you can find the link to the site where each myxovirus sequence was downloaded for the taxa present in the phylogeny studied by Hou et al. 2007. Some sequences have additional notes used to highlight issues we encountered when trying to download them:

In Hou et al. 2007, the sequences of rhesus macaque and chimpanzee were extracted from the ENSEMBL predicted transcripts (version 39, 2006), whereas all other mRNA sequences or cDNA sequences were retrieved from GenBank.

NOTE: For ENSEMBL, we could not access ENSEMBL v39, so we downloaded the two sequences from the current ENSEMBL version (ENSEMBL v104, May 2021 -- links provided above). Rhesus macaque: It said that the gene to export was ENSMMUG00000015329.4 (MX1). Therefore this is the variant downloaded. Chimpanzee: It said that the gene to export was ENSPTRG00000013927.5 (MX1). Therefore this is the variant downloaded.

2. Generating gene alignments

According to Hou et al. 2007, the coding sequences were aligned based on translated protein sequences for multispecies data. The alignment of multiple DNA sequences based on predicted protein sequences was generated using the ClustalW program.

Given that the alignments or the code to generate the alignments were not provided as supplementary material, we proceeded to generate the gene alignments based on the details they wrote in their paper with the aim to get an inferred alignment close to the one they might have been using, which are summarised in the paragraph above.

Processing raw input data

First, we had to concatenate all the sequences we had previously downloaded (see links provided above) in a unique FASTA file, which we had saved in the raw_data/data1 directory. In order to do this, we ran the following code from directory raw_data:

# Run from `raw_data` directory 
for i in data1/*fasta
do
dir=$( echo $i | sed 's/\/..*//' )
name=$( echo $i | sed 's/..*\///' | sed 's/\.fasta//' )
sed 's/^>..*/\>'${name}'/' $i >> $dir/$dir"_unaln_raw.fasta"
printf "\n" >> $dir/$dir"_unaln_raw.fasta"
done 

Then, we generated a FASTA file in which each sequence was written in one line, one after the other:

# First, give permissions to the `scripts` 
# directory so that you can execute the
# various scripts saved in this directory
# throughout this tutorial
# Once you are in the `scripts` directory,
# run the following two commands
chmod 775 *
chmod 775 pal2nal_v14/*
# Now, move to the `raw_data` directory
cd ../raw_data
# From this directory, run the following commands
for i in data1/*raw.fasta
do
name=$( echo $i | sed 's/\.fasta//' )
name2=$( echo $name | sed 's/\_raw//' )
../scripts/one_line_fasta.pl $i 
mv $name"_one_line.fa" $name2".fasta"
done

Aligning sequence data

First procedure followed to align dataset 1

Once we had our sequences in a unique FASTA file, we used the pipeline TranslatorX to get the protein alignment. The settings we chose are the following:

  • Build the alignment with ClustalW, same software the authors of the study used.
  • Set the same genetic code for all taxa and select the option to guess the most likely reading frame.
  • We did not choose to clean the alignment in blocks.

Files were saved as data1_nuc_aln.fasta and data1_prot_aln.fasta. We also saved the log generated by this tool (translatorx_3423.clustalw.log.txt) when ClustalW was used to generate the alignment as well as the log file with the compositional bias (GCcontent_log_translatorx.txt). All these output files are saved in alignments/data1.

Then, we also ran PAL2NAL to convert the alignments in codon data (just to double check the nucleotide alignment is the same). PAL2NAL is a PERL script (see directory scripts/pal2nal_v14) that uses a protein alignment and a file with the DNA sequence to match, and then outputs a codon-based DNA alignment. The commands we used are the following:

# Run from the directory where this
# README.md file is
cd alignments/data1
mkdir pal2nal_checks
../../scripts/pal2nal_v14/pal2nal.pl data1_prot_aln.fasta data1_nuc_aln.fasta -output fasta > pal2nal_checks/data1pal2nal_out.fasta
# Convert output file to one line fasta file
../../scripts/one_line_fasta.pl pal2nal_checks/data1pal2nal_out.fasta

The last step is to convert the alignment from FASTA to PHYLIP format. For that purpose, we use the PERL script FASTAtoPHYL.pl (saved in the scripts directory) as it follows:

# Run from this directory where
# this README.md file is
cd alignments/data1
num=$( grep '>' data1_nuc_aln.fasta | wc -l )
len=$( sed -n '2,2p' data1_nuc_aln.fasta | sed 's/\r//' | sed 's/\n//' | wc -L )
perl ../../scripts/FASTAtoPHYL.pl data1_nuc_aln.fasta $num $len 
mv data1_nuc_aln.phy data1.phy

Second procedure followed to align dataset 1

We detected that the alignment generated when using ClustalW resulted in a section of ~189 gaps for most of the mammal taxa except for the outgroup taxa and the mouse sequence. Therefore, we decided to use MAFFT as an alternative approach to generate the alignment for dataset 1 (note that other software such as PRANK (Löytynoja and Goldman 2005, 2008) or PAGAN (Löytynoja et al. 2012), among many other available aligners, could have been used instead). First, we downloaded the translatorX.pl PERL script and saved it in the directory translatorX_perl. We copied the unaligned FASTA file we had already generated for dataset 1 (i.e., data1_unaln.fasta) and ran the following:

# Run from `00_data/translatorX_perl` directory
./translatorX.pl -i data1_unaln.fasta -p F -o mafft_translatorx
mkdir ../alignments_mafft
cp mafft_translatorx.nt_ali.fasta ../alignments_mafft/data1_nuc_mafft_aln.fasta
cp mafft_translatorx.aa_ali.fasta ../alignments_mafft/data1_prot_mafft_aln.fasta

NOTE: We saved the screen output in file log.out_translatorx.txt.

Then, we ran PAL2NAL to convert the alignments in codon data (just to double check the nucleotide alignment is the same). This is a PERL script (see directory scripts/pal2nal_v14) that uses a protein alignment and a file with the DNA sequence to match, and then outputs a codon-based DNA alignment. The commands we used are the following:

# Run from this directory where this
# README.md file is
cd alignments_mafft
mkdir pal2nal_checks
../scripts/pal2nal_v14/pal2nal.pl data1_prot_mafft_aln.fasta data1_nuc_mafft_aln.fasta -output fasta > pal2nal_checks/data1_pal2nal_mafft_out.fasta
# Convert output file to one line fasta file
../scripts/one_line_fasta.pl pal2nal_checks/data1_pal2nal_mafft_out.fasta

The last step is to convert the FASTA file in PHYLIP format. For that purpose, we use the PERL script FASTAtoPHYL.pl (saved in the scripts directory) as it follows:

# Run from this directory where
# this README.md file is
cd alignments_mafft
num=$( grep '>' data1_nuc_mafft_aln.fasta | wc -l )
len=$( sed -n '2,2p' data1_nuc_mafft_aln.fasta | sed 's/\r//' | sed 's/\n//' | wc --m )
perl ../scripts/FASTAtoPHYL.pl data1_nuc_mafft_aln.fasta $num $len 
mv data1_nuc_mafft_aln.phy data1_mafft.phy

Third procedure followed to align dataset 1

We detected that both alignments generated with ClustalW and MAFFT resulted in a section with gaps (more gaps with ClustalW than with MAFFT). Therefore, we took the alignment generated with MAFFT, data1_mafft.phy, and manually got rid of the first 213 nucleotides (71 codons). The resulting alignment (1992 bp) can be found in directory alignments_mafft_nogaps as data1_mafft_nogaps.phy.

3. Generating gene tree

Procedure described in Hou et al. 2007

The phylogenetic tree was constructed by the neighbor-joining (NJ) method Saitou and Nei, 1987 based on predicted protein sequences. The p-distances (proportion of differences) with a high resolution of branching pattern were calculated Nei and Kumar, 2000. One thousand bootstrap replicates were carried out to test the support for each node in the tree.

Procedure we followed

During this study, we followed a maximum-likelihood approach to estimate the best-scoring maximum likelihood tree with RAxML v8.2.10 (Stamatakis 2014). We used the following code:

# NOTE: This command was included in a script 
# (`scripts/run_raxml.sh`) as part of a job 
# array ran in an HPC.
# Directory `1` corresponds to `RAxML` directory,
# `2` to `RAxML_mafft`, and `3` to
# `RAxML_mafft_nogaps`.
# The first directory contains the alignment
# generated with `ClustalW`, the second the one 
# generated with `MAFFT`, and the third has the
# same alignment as in directory `MAFFT` but 
# without the gap section.
# The input file, variable `$seq`, is `data1.phy`, 
# `data1_mafft.phy`, or `data1_mafft_nogaps.phy`; respectively.
# Variable `$name` is `data1`, `data1_mafft`, or 
# `data1_mafft_nogaps`; respectively. 
raxmlHPC -f a -m GTRGAMMA -p 12345 -# 100 -x 12345 -# 500 -s $seq -o Duck,Chicken -n $name

For each analysis with each gene alignment, the best-scoring ML trees are RAxML/1/RAxML_bestTree.data1, RAxML_mafft/1/RAxML_bestTree.data1_mafft, and RAxML_mafft_nogaps/1/RAxML_bestTree.data1_mafft_nogaps; respectively. We viewed these trees in FigTree and made sure that the branches for duck and chicken were used to root the tree. We also ordered the nodes in increasing order so it matched the way the tree had been displayed in the original paper.

The rooted trees were then saved as data1.tree (Newick format) in each corresponding directory. All of the inferred gene trees agree with the tree topology.

NOTE: The best-scoring ML tree for dataset 1 in Hou et al. 2007 (Fig. S1) has C. familiaris as an outgroup to the clade with pig, cow, and sheep. We did not obtain this tree topology. We have kept the tree topology estimated by RAxML as the same tree topology was inferred regardless of the alignment used. NOTE: The successor of RAxML v8.2.10 is RAxML-NG (Kozlov et al. 2019) (see the GitHub with all the detailed tutorials to run RAxML-NG). We had already run our code with RAxML v8.2.10 and, given that the estimated tree topology matched the one obtained by the original study, we did not redo the tree inference with RAxML-NG. Nevertheless, had we needed to use the branch lengths (CODEML only requires the tree topology, not the branch lenghts) or had we encountered a conflictive or unresolved phylogeny, we would have re-run our ML inference with RAxML-NG as recommended by the software developers.

Then, we just used the next command to get rid of the branch lengths and obtain the tree topologies to be used in CODEML:

# Run this code from the 
# directory where this
# `README.md` file is 
for i in RAxML* 
do
echo Parsing tree in directory $i
printf "12  1\n" > $i/data1_nobl.tree
sed 's/\:[0-9]*\.[0-9]*//g' $i/data1.tree >> $i/data1_nobl.tree
done

4. Summary

For the analyses with CODEML, we decided to use the alignment generated with MAFFT and in which the gap section at the beginning of the alignment was removed. Before getting started, however, we decided to reorder the position of the sequences in the alignment so they matched the order in which the taxa are found in the phylogeny. In that way, we expect the output data by CODEML to be read in a more intuitive manner. For that purpose, we did the following:

# Create an alignment file with the PHYLIP 
# header and the next line, then grep the 
# sequences according to phylogeny
# Run this code from the directory where this 
# `README.md` file is found
sed -n '1,2p' alignments_mafft_nogaps/data1_mafft_nogaps.phy > Mx_aln.phy
grep 'Rhesus' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Orangutan' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Chimpanzee' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Human' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Dog' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Pig' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Cow' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Sheep' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Rat' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Mouse' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Chicken' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy
grep 'Duck' alignments_mafft_nogaps/data1_mafft_nogaps.phy >> Mx_aln.phy

# Add an identifier in the tag 
# to show they are myxovirus 
# sequences
sed -i 's/      /\_Mx      /g' Mx_aln.phy

Last, as we modified the tag names in the alignment, we need to do the same for the tree file. Therefore:

# Run this code from the directory where this 
# `README.md` file is found
cp RAxML_mafft_nogaps/data1_nobl.tree Mx_root.tree
sed -i 's/Chimpanzee/Chimpanzee\_Mx/' Mx_root.tree
sed -i 's/Human/Human\_Mx/' Mx_root.tree
sed -i 's/Orangutan/Orangutan\_Mx/' Mx_root.tree
sed -i 's/Rhesus\_macaque/Rhesus\_macaque\_Mx/' Mx_root.tree
sed -i 's/Sheep/Sheep\_Mx/' Mx_root.tree
sed -i 's/Cow/Cow\_Mx/' Mx_root.tree
sed -i 's/Pig/Pig\_Mx/' Mx_root.tree
sed -i 's/Dog/Dog\_Mx/' Mx_root.tree
sed -i 's/Mouse/Mouse\_Mx/' Mx_root.tree
sed -i 's/Rat/Rat\_Mx/' Mx_root.tree
sed -i 's/Duck/Duck\_Mx/' Mx_root.tree
sed -i 's/Chicken/Chicken\_Mx/' Mx_root.tree

After that, we manually unroot the tree before starting the analyses in CODEML (i.e., we remove the parentheses that include the duck and chicken tags) and save it as Mx_unroot.tree. You can also generate the tree if you run the next command:

# Run this code from the directory where this 
# `README.md` file is found
cp Mx_root.tree Mx_unroot.tree
sed -i 's/(Duck/Duck/' Mx_unroot.tree
sed -i 's/Chicken\_Mx)/Chicken\_Mx/' Mx_unroot.tree
printf "\n" >> Mx_unroot.tree
# Resulting unrooted tree you should see in the
# output file:
#   ((((((Chimpanzee_Mx,Human_Mx),Orangutan_Mx),Rhesus_macaque_Mx),(((Sheep_Mx,Cow_Mx),Pig_Mx),Dog_Mx)),(Mouse_Mx,Rat_Mx)),Duck_Mx,Chicken_Mx);

Once the input files for CODEML have been generated (i.e., Mx_aln.phy, Mx_root.tree, and Mx_unroot.tree), we can copy them into the 00_protocol_CODEML directory in this GitHub repository. Please note that the Mx_aln.phy alignment file you will find in this directory still has STOP codons, and hence it has 1992 nucleotides. Nevertheless, the Mx_aln.phy you will find in the 00_protocol_CODEML directory does not have STOP codons (i.e., the alignment length is reduced 3 nucleotides, and so the PHYLIP header is changed). We manually removed the last three nucleotides and changed the header so the alignment is now ready to be used with CODEML.