Skip to content
Brian Haas edited this page Nov 5, 2024 · 30 revisions

Gene Structure Annotation and Analysis Using PASA

PASA, acronym for Program to Assemble Spliced Alignments (and pronounced 'pass-uh'), is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Now available: A hybrid approach to transcript reconstruction using genome-guided and de novo RNA-Seq assemblies to generate a comprehensive transcript database.

Using short (eg. Illumina) and/or long (eg. PacBio or Nanopore) reads? First assemble using genome-guided (eg. StringTie, Trinity ) or genome-free de novo (Trinity, Rattle ), and then use these transcriptome assemblies as input to PASA.

Table of Contents

Introduction

PASA was originally developed at The Institute for Genomic Research in the early 2000's as an effort to automatically improve gene structures in Arabidopsis thaliana. Since then, it has been applied to numerous Eukaryotic genome annotation projects including Rice, Aspergillus species, Plasmodium falciparum, Schistosoma mansoni, Aedes aegypti, mouse, human, among others.

Functions of PASA include:

  • model complete and partial gene structures based on assembled spliced alignments.

  • automatically incorporate gene structures based on transcript alignments into existing gene structure annotations, thereby maintaining annotations consistent with experimental evidence. Annotation updates include

    • annotating untranslated regions (UTRs)
    • exon additions, deletions, boundary adjustments
    • addition of models for alternative splicing variants
    • merging genes
    • splitting genes
    • modeling novel genes
  • map polyadenylation sites to the genome

  • identify antisense transcripts

  • identify and classify all found splicing variations

  • report a likely set of partial and/or full-length protein-coding genes based on transcript alignments for training ab initio gene prediction tools.

PASA is composed of a pipeline of utilities that perform the following ordered set of tasks:

  • cleaning the transcripts
    • The seqclean utility, developed by the TIGR Gene Index group, is used to identify evidence of polyadenylation and strip the poly-A, trim vector, and discard low quality sequences.
  • mapping and aligning transcripts to the genome
    • GMAP and/or BLAT is used to map and align the transcripts to the genome.
  • Validate nearly perfect alignments
    • PASA utilizes only near perfect alignments. These alignments are required to align with a specified percent identity (typically 95%) along a specified percent of the transcript length (typically 90%). Each alignment is required to have consensus splice sites at all inferred intron boundaries, including (GT/GC donor with an AG acceptor, or the AT-AC U12-type dinucleotide pairs).
  • Maximal assembly of spliced alignments
    • The valid transcript alignments are clustered based on genome mapping location and assembled into gene structures that include the maximal number of compatible transcript alignments. Compatible alignments are those that have identical gene structures in their region of overlap. The products are termed PASA maximal alignment assemblies. Those assemblies that contain at least one full-length cDNA are termed FL-assemblies; the rest are non-FL-assembles.
  • Grouping alternatively spliced isoforms
    • Alignment assemblies that map to the same genomic locus, significantly overlap, and are transcribed on the same strand, are grouped into clusters of assemblies.
  • Automatic Genome Annotation
    • Given a set of existing gene structure annotations, which may include the latest annotation for a given genome or the results of a single ab-initio gene finder, a comparison to the PASA alignment assemblies is performed. Each alignment assembly is assigned a status identifier based on the results of the annotation comparison. The status identifier indicates whether or not the update is sanctioned as likely to improve the annotation, and the type of update that the assembly provides. There are over 40 different status identifiers (actually, about 20 since half correspond to FL-assemblies and the other half to non-FL-assemblies).
    • In the absence of any preexisting gene annotations, novel genes and alternative splicing isoforms of novel genes can be modeled.
    • At any time, regardless of any existing annotations, users can obtain candidate gene structures based on the longest open reading frame (ORF) found within each PASA alignment assembly. The output includes a fasta file for the proteins and a GFF3 file describing the gene structures. This is useful when applied to a previously uncharacterized genome sequence, allowing one to rapidly obtaining a set of candidate gene structures for training various ab-intio gene prediction programs. In the case of RNA-Seq, PASA can generate a full transcriptome-based genome annotation, identifying likely coding and non-coding transcripts.

Managing Expectations: PASA was incredibly useful to us back in the early days of genomics (circa early 2000's), represents one of the early open source bioinformatics projects, and continues to be useful to many today. But - almost all of the code was written over 20 years ago using methods that were reasonably well suited to computer architectures available at that time. There's currently no structured active development on PASA, but that could change over time.

PASA in the Context of a Complete Eukaryotic Annotation Pipeline

PASA is only one component of a larger eukayotic annotation pipeline. Comprehensive genome annotation relies on more than transcript sequence evidence. Not all genes are expressed under assessed conditions, and some genes are expressed at low levels, which complicates their discovery and proper annotation. Other forms of evidence are required for comprehensive genome annotation, including ab initio gene predictors and homology to proteins previously discovered in other sequenced genomes. A complete annotation pipeline, as implemented at the Broad Institute, involves the following steps:

  • (A) ab initio gene finding using a selection of the following software tools: GeneMarkHMM, FGENESH, Augustus, and SNAP, GlimmerHMM.
  • (B) protein homology detection and intron resolution using the GeneWise software and the uniref90 non-redundant protein database.
  • (C) alignment of known ESTs, full-length cDNAs, and most recently, Trinity RNA-Seq assemblies to the genome.
  • (D) PASA alignment assemblies based on overlapping transcript alignments from step (C)
  • (E) use of EVidenceModeler (EVM) to compute weighted consensus gene structure annotations based on the above (A, B, C, D)
  • (F) use of PASA to update the EVM consensus predictions, adding UTR annotations and models for alternatively spliced isoforms (leveraging D and E).
  • (G) limited manual refinement of genome annotations (F) using Apollo

The following review of eukaryotic genome annotation methods describes in detail the use of PASA in the context of a more complete eukaryotic genome annotation system - see Haas et al., Mycology. 2011 Oct 3;2(3):118-141.

The use of PASA in both applications: first assembling transcript alignments into PASA alignment assemblies, and then later using those PASA assemblies to update EVM consensus (or other) annotations, are described below.

System Overview

PASA runs on a UNIX/LINUX-based architecture (including mac-osx). PASA involves components written in Perl and C++. Utilities used by PASA, including GMAP, are wrapped by Perl code. Results are provided in summary text files including use of standard formats such as gtf, gff3, bed, fasta, and others. Results are further available for analysis using the companion suite of Web-based tools and command-line utilities. Running PASA to generate alignment assemblies requires only two inputs: the targeted genome in FASTA format and the inputted transcripts (ESTs, de novo RNA-Seq assemblies, etc.) in FASTA format.

In order to compare the assemblies to existing gene structure annotations, and optionally enhance known structures by adding UTRs, alt-splice variants, and exon adjustments, preexisting gene structure annotations can be provided in GFF3 format, or imported by a user-customized data adapter (described below).

Sample data and a preconfigured complete PASA pipeline are available for demonstration purposes, all included in the software distribution.

Obtaining PASA

Download the latest version of the PASA software from GitHub.

See the wiki tabs to the right for installation instructions and tutorials.

For more info, visit the Tour of the PASA web portal

References

The PASA software and its original application are described in:

  • Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.

The use of PASA to analyze polyadenylation signals is described in:

Enhancements to PASA that automate the identification and classification of alternative splicing variations are described here:

Using PASA along with EVidenceModeler in a complete eukaryotic genome annotation pipeline

Earlier work involving the incorporation of RNA-Seq data into gene structure annotation improvements using PASA and the Inchworm component of Trinity: (Note, the new PASA/Trinity process described above is considerably different in execution, but similar in principle. Manuscript in prep.)

Contact Us

Questions, suggestions, comments, etc?

Join and add discussions at the PASA pipeline users google group: https://groups.google.com/forum/#!forum/pasapipeline-users