PASA_alignment_assembly

Running the Alignment Assembly Pipeline

As input to the command-line driven PASA pipeline, we need only two (potentially three) input files.
- The genome sequence in a multiFasta file (ie. genome.fasta)
- The transcript sequences in a multiFasta file (ie. transcripts.fasta)
- Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)

Step A: cleaning the transcript sequences [Optional]

Have each of these files in the same 'working' directory. Then, run the seqclean utility on you transcripts like so:

  % $PASAHOME/bin/seqclean  transcripts.fasta

If you have a database of vector sequences (ie. UniVec), you can screen for vector as part of the cleaning process by running the following instead:

  % $PASAHOME/bin/seqclean  transcripts.fasta -v /path/to/your/vectors.fasta

This will generate several output files including transcripts.fasta.cln and transcripts.fasta.clean Both of these can be used as inputs to PASA.

Note that if you are running seqclean in Nextflow, you will need to set the USER environment variable, see comment

Step B: Walking Thru A Complete Example Using the Provided Sample Data

Sample inputs are provided in the $PASAHOME/sample_data directory. We'll use these inputs to demonstrate the breadth of the software application, including using sample DATA ADAPTERs to import existing gene annotations into the database, and tentative structural updates out.

The PASA pipeline requires separate configuration files for the alignment assembly and later annotation comparison steps, and these are configured separately for each run of the PASA pipeline, setting parameters to be used by the various tools and processes executed within the PASA pipeline. Configuration file templates are provided as '$PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt' and '$PASAHOME/pasa_conf/pasa.annotationCompare.Template.txt', and these will be further described when used below.

The next steps explain the current contents of the sample_data directory. You do NOT need to redo these operations:

I've copied the $PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt to alignAssembly.config and edited the pasa database name to '/tmp/sample_mydb_pasa'.

Note, if you set the database name to a fully qualified path (ie. /path/to/my/database.sqlite), it will use SQLite for the relational database type. If you simply specify a database name (ie. my_pasa_db), it will default to using MySQL.

My required input files exist as: genome_sample.fasta, all_transcripts.fasta, and since I have some full-length cDNAs, I'm including 'FL_accs.txt' to identify these as such.
I already ran seqclean to generate files: all_transcripts.fasta.clean and all_transcripts.fasta.cln

The following steps, you must execute in order to demonstrate the software. (The impatient can execute the entire pipeline below by running './run_sample_pipeline.pl'. If this is your first time through, it helps to walk through the steps below instead.)

Transcript alignments followed by alignment assembly

Run the PASA alignment assembly pipeline like so:

     %  $PASAHOME/Launch_PASA_pipeline.pl \
           -c alignAssembly.config -C -R -g genome_sample.fasta \
           -t all_transcripts.fasta.clean -T -u all_transcripts.fasta \
           -f FL_accs.txt --ALIGNERS blat,gmap,minimap2 --CPU 2

The '--ALIGNERS' can take values 'gmap', 'blat', 'minimap2', or some combination (ie. 'gmap,blat'), in which case both aligners will be executed in parallel. The CPU setting determines the number of threads to be split among each process. This is passed on to GMAP to indicate the thread count. In the case of BLAT, the new pblat utility is used for parallel processing.

This executes the following operations, generating the corresponding output files:

aligns the all_transcripts.fasta file to genome_sample.fasta using the specified alignment tools. Files generated include:
- 'sample_mydb_pasa.validated_transcripts.gff3,.gtf,.bed' :the valid alignments
- 'sample_mydb_pasa.failed_gmap_alignments.gff3,.gtf,.bed' :the alignments that fail validation test
- 'alignment.validations.output' :tab-delimited format describing the alignment validation results
the valid alignments are clustered into piles based on genome alignment position and piles are assembled using the PASA alignment assembler. Files generated include:
- 'sample_mydb_pasa.assemblies.fasta' :the PASA assemblies in FASTA format.
- 'sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed' :the PASA assembly structures.
- 'sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out' :descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.
- 'sample_mydb_pasa.pasa_assemblies_described.txt' :tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PASA_alignment_assembly

Running the Alignment Assembly Pipeline

Step A: cleaning the transcript sequences [Optional]

Step B: Walking Thru A Complete Example Using the Provided Sample Data

Transcript alignments followed by alignment assembly

Clone this wiki locally