-
Notifications
You must be signed in to change notification settings - Fork 58
PASA_alignment_assembly
- As input to the command-line driven PASA pipeline, we need only two (potentially three) input files.
- The genome sequence in a multiFasta file (ie. genome.fasta)
- The transcript sequences in a multiFasta file (ie. transcripts.fasta)
- Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)
Have each of these files in the same 'working' directory. Then, run the seqclean utility on you transcripts like so:
% $PASAHOME/bin/seqclean transcripts.fasta
If you have a database of vector sequences (ie. UniVec), you can screen for vector as part of the cleaning process by running the following instead:
% $PASAHOME/bin/seqclean transcripts.fasta -v /path/to/your/vectors.fasta
This will generate several output files including transcripts.fasta.cln and transcripts.fasta.clean Both of these can be used as inputs to PASA.
Note that if you are running seqclean in Nextflow, you will need to set the USER environment variable, see comment
Sample inputs are provided in the $PASAHOME/sample_data directory. We'll use these inputs to demonstrate the breadth of the software application, including using sample DATA ADAPTERs to import existing gene annotations into the database, and tentative structural updates out.
The PASA pipeline requires separate configuration files for the alignment assembly and later annotation comparison steps, and these are configured separately for each run of the PASA pipeline, setting parameters to be used by the various tools and processes executed within the PASA pipeline. Configuration file templates are provided as '$PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt' and '$PASAHOME/pasa_conf/pasa.annotationCompare.Template.txt', and these will be further described when used below.
The next steps explain the current contents of the sample_data directory. You do NOT need to redo these operations:
- I've copied the $PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt to alignAssembly.config and edited the pasa database name to '/tmp/sample_mydb_pasa'.
Note, if you set the database name to a fully qualified path (ie. /path/to/my/database.sqlite), it will use SQLite for the relational database type. If you simply specify a database name (ie. my_pasa_db), it will default to using MySQL.
- My required input files exist as: genome_sample.fasta, all_transcripts.fasta, and since I have some full-length cDNAs, I'm including 'FL_accs.txt' to identify these as such.
- I already ran seqclean to generate files: all_transcripts.fasta.clean and all_transcripts.fasta.cln
The following steps, you must execute in order to demonstrate the software. (The impatient can execute the entire pipeline below by running './run_sample_pipeline.pl'. If this is your first time through, it helps to walk through the steps below instead.)
- Run the PASA alignment assembly pipeline like so:
% $PASAHOME/Launch_PASA_pipeline.pl \
-c alignAssembly.config -C -R -g genome_sample.fasta \
-t all_transcripts.fasta.clean -T -u all_transcripts.fasta \
-f FL_accs.txt --ALIGNERS blat,gmap,minimap2 --CPU 2
The '--ALIGNERS' can take values 'gmap', 'blat', 'minimap2', or some combination (ie. 'gmap,blat'), in which case both aligners will be executed in parallel. The CPU setting determines the number of threads to be split among each process. This is passed on to GMAP to indicate the thread count. In the case of BLAT, the new pblat utility is used for parallel processing.
This executes the following operations, generating the corresponding output files:
-
aligns the all_transcripts.fasta file to genome_sample.fasta using the specified alignment tools. Files generated include:
- 'sample_mydb_pasa.validated_transcripts.gff3,.gtf,.bed' :the valid alignments
- 'sample_mydb_pasa.failed_gmap_alignments.gff3,.gtf,.bed' :the alignments that fail validation test
- 'alignment.validations.output' :tab-delimited format describing the alignment validation results
-
the valid alignments are clustered into piles based on genome alignment position and piles are assembled using the PASA alignment assembler. Files generated include:
- 'sample_mydb_pasa.assemblies.fasta' :the PASA assemblies in FASTA format.
- 'sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed' :the PASA assembly structures.
- 'sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out' :descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.
- 'sample_mydb_pasa.pasa_assemblies_described.txt' :tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.
- PASA Pipeline Wiki Home
- Software Installation Instructions
- Running the Alignment Assembly Pipeline
- Leveraging RNA-Seq by the PASA Pipeline
- Build a comprehensive transcriptome database, integrating genome-guided and genome-free transcript reconstructions
- Genome annotation - comparisons and updates
- Alternative Splicing Analysis
- Other useful PASA applications
- Navigating PASA reports via Pasa Web
- Miscellaneous tidbits