Skip to content

MikeAxtell/GSTAr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2c96ca4 · Mar 22, 2014

History

1 Commit
Mar 22, 2014
Mar 22, 2014
Mar 22, 2014

Repository files navigation

LICENSE
    GSTAr.pl

    Copyright (c) 2013 Michael J. Axtell

    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation, either version 3 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program. If not, see <http://www.gnu.org/licenses/>.

SYNOPSIS
    GSTAr : Generic Small RNA-Transcriptome Aligner.

    Flexible RNAplex-based alignment of miRNAs and siRNAs (15-26 nts) to a
    transcriptome.

AUTHOR
    Michael J. Axtell, Penn State University, mja18@psu.edu

VERSION
    1.0 : September 17, 2013

INSTALL
  Dependencies

            perl
            RNAplex (from Vienna RNA package)

    RNAplex must be executable from your PATH. GSTAr.pl was developed using
    RNAplex version RNAplex 2.1.3. It has not been tested on other versoins
    of RNAplex.

  Installation

    No "real" installation. If the script is in your working directory, you
    can call it with

            ./GSTAr.pl

    For convenience, you can add it to your PATH. e.g.

            sudo mv GSTAr.pl /usr/bin/

    GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1
    (the hashbang) accordingly.

USAGE
            GSTAr.pl [options] queries.fasta transcriptome.fasta

    Output alignments go to STDOUT, and can be redirected to a file with >
    or piped to another process with |

    Log and progress information goes to STDERR, and can be suppressed with
    option -q (quiet mode).

  Options

    Options:

    -h Print help message and quit

    -v Print version and quit

    -q Quiet mode .. no log/progress information to STDERR

    -t Tabular output format ... More suitable for parsing

    -a Sort by Allen et al. score instead of the default MFEratio

    -r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65

METHODS
    GSTAr.pl is essentially a wrapper and parser for RNAplex desgined for
    aligning short queries (15-26nts) against the rev-comp. strand of a
    eukaryotic transcriptome. For each query sequence, the minimum free
    energy of a perfectly complelementary sequence is calculated using
    RNAplex under all default parameters. Following this, the same query is
    then analyzed against the entire transcriptome. Hits where the MFEratio
    (i.e. MFE / MFE-perfect) is >= the cutoff established by option r are
    retained and parsed.

    The detailed RNAplex parameters (see RNAplex man page for details) for
    each query analysis are

            -f 2 : Fast mode .. structure based on approximated plex model.
            -e [minMFEratio * perfectMFE] : Minimum acceptable MFE value to keep a hit
            -z : [10 + query_length] : Acceptable alignments can span no more than length of query + 10nts.
        
    Slice Site is the transcript nt opposite nt 10 of the query. This is
    where one should look to find evidence of AGO-catalyzed slicing in the
    event that a) the transcript was really a target of the query at that
    site, and b) it really was sliced. GSTAr makes no judgements on the
    likelihood of either of those events, and the recording of the Slice
    Site position should NOT be taken as evidence that slicing exists or is
    even possible at that site.

    For a given query, all returned sites are non-redundant. Redunancy is
    based upon the putative slice site only. By default, the output is
    sorted in descending order (best to worst) according to the MFE ratio.
    In the alternative option -a mode, the results are instead sorted in
    ascending order (best to worst) according to the Allen et al. score.

    Allen et al. score: This is a score for plant miRNA/siRNA-target
    interactions based on the position-specific penalties described by Allen
    et al. (2005) Cell, 121:207-221 [PMID: 15851028]. Specifically,
    mismatched query bases or target-bulged bases, are penalized 1. G-U
    wobbles are penalized 0.5. These penalties are double within positions
    2-13 of the query.

WARNINGS
  NOT a target predictor

    GSTAr is very explicitly NOT a target predictor for miRNAs or siRNAs. It
    is only an aligner based on RNA-RNA hybridization thermodynamic
    predictions. Users should make no claims as to whether the identified
    alignments are actually targets of the query without independent data of
    some sort.

  Slice Sites are NOT predictions of slicing

    Although GSTAr reports a "Slicing Site" position for each alignment,
    this is merely for conveneince when using GSTAr alignments to guide
    subsequent experiments searching for AGO-catalyzed slicing evidence. No
    claim is made that any alignment is actually AGO-cleaved or even
    theoretically AGO-cleavable.

  Not for whole genomes

    GSTAr holds the entire contents of the transcripts.fasta file in memory
    to speed the isolation of sub-sequences. This will be impractical in
    terms of memory usage if a user attempts to load a whole genome.
    Similarly, GSTAr will only search for pairing between the top strand of
    the transcripts.fasta file, making it also impractical for a genome
    analysis, where sites might be on either strand.

  Temp files

    GSTAr writes temp files to the working directory. Their contents change
    dynamically during a run, and they will be deleted at the end of a run.
    So, don't mess with them during a run. In addition, it is a very bad
    idea to have two GSTAr runs operating concurrently from the same working
    directory because there will be clashes and overwrites for these temp
    files.

  Not too fast

    GSTAr uses RNAplex (Tafer and Hofacker, 2008. Bioinformatics 24:2657-63,
    PMID: 18434344, doi:10.1093/bioinformatics/btn193), which is
    exceptionally fast for an inter-molecular RNA-RNA hybridization
    calculator. However, when applied to entire eukaryotic transcriptomes
    the CPU time per query is still significant. Run time is only slightly
    affected (much less than 2-fold) by the setting of -r. Setting tabular
    mode (option -t) also increases speed just a tiny bit for runs with a
    low option -r. In tests with the Arabidopsis transcriptome (33,602
    mRNAs, total nts=51,074,197), a single 21nt miRNA query typically takes
    about 90-110 seconds to complete.

  No ambiguity codes

    Query sequences with characters other than A, T, U, C, or G
    (case-insensitive) will not be analyzed, and a warning will be sent to
    the user. Transcript sub-sequences for potential alignments will be
    *silently* ignored if they contain any characters other than A, T, U,
    C,or G (case-insensitive).

  Small queries

    Query sequences must be small (between 15 and 26nts). Queries that don't
    meet these size requirements will not be analyzed and a warning sent to
    the user.

  Redundant output

    GSTAr guarantees that, FOR A GIVEN QUERY, the returned alignments will
    be unique in terms of their PUTATIVE SLICING SITE POSITION. However, the
    same query could generate multiple overlapping aligments that each have
    different putative slice sites. Furthermore, if different queries in a
    multi-query analysis have similar (or identical !) sequences, the same
    alignment position (based on putative slicing site position) could be
    returned multiple times, once for each of the similar/identical queries.
    Therefore, for multi-query result files, there is no guarantee of
    non-redundancy among the returned sites.

OUTPUT
    Both output formats begin with a series of commented lines that provide
    details about the run.

  Pretty output

    The default output "pretty" mode is easily parsed by humans and should
    be self-explanatory

  Tabular output

    Tabular output (which occurs when option -t is specified) is a
    tab-delimited text file. The first non-commented line is the column
    headings, with meanings as follows:

    1: Query: Name of query

    2: Transcript: Name of transcript

    3: TStart: One-based start position of the alignment within the
    transcript

    4: TStop: One-based stop position of the alignment within the transcript

    5: TSLice: One-based position of the alignment opposite position 10 of
    the query

    6: MFEperfect: Minimum free energy of a perfectly matched site
    (approximate)

    7: MFEsite: Minimum free energy of the alignment in question

    8: MFEratio: MFEsite / MFEperfect

    9: AllenScore: Penalty score calculated per Allen et al. (2005) Cell,
    121:207-221 [PMID: 15851028].

    10: Paired: String representing paired positions in the query and
    transcript. The format is Query5'-Query3',Transcript3'-Transcript5'.
    Positions are one-based. Discrete blocks of pairing are separated by ;

    11: Unpaired: String representing unpaired positions in the query and
    transcript. The format is
    Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are
    "UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3'
    end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric
    internal loop with more unpaired nts on the transcript side), "AILq"
    (asymmetric internal loop with more unpaired nts on the query side),
    "BULt" (Bulged on the transcript side), or "BULq" (bulged on the query
    side). Positions are one-based. Discrete blocks of pairing are separated
    by ;

    12: Structure: Aligned secondary structure. The region before the "&"
    represents the transcript, 5'-3', while the region after the "&"
    represents the query, 5'-3'. "(" represents a transcript base that is
    paired, ")" represents a query based that is paired, "." represents an
    unpaired base, and "-" represents a gap inserted to facilitate
    alignment.

    13: Sequence: Aligned sequence. The region before the "&" represents the
    transcript, 5'-3', while the region after the "&" represents the query,
    5'-3'.

About

GSTAr: Generic Small RNA-Transcriptome Aligner

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages