GitHub - MikeAxtell/GSTAr: GSTAr: Generic Small RNA-Transcriptome Aligner

1 Branch 1 Tag

Name	Name	Last commit message	Last commit date
Latest commit MikeAxtell GSTAr version 1.0 Mar 22, 2014 2c96ca4 · Mar 22, 2014 History 1 Commit
GSTAr.pl	GSTAr.pl	GSTAr version 1.0	Mar 22, 2014
LICENSE	LICENSE	GSTAr version 1.0	Mar 22, 2014
README	README	GSTAr version 1.0	Mar 22, 2014

Repository files navigation

LICENSE
GSTAr.pl

This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.

You should have received a copy of the GNU General Public License along
with this program. If not, see <http://www.gnu.org/licenses/>.

SYNOPSIS
GSTAr : Generic Small RNA-Transcriptome Aligner.

Flexible RNAplex-based alignment of miRNAs and siRNAs (15-26 nts) to a
transcriptome.

AUTHOR
Michael J. Axtell, Penn State University, mja18@psu.edu

VERSION
1.0 : September 17, 2013

INSTALL
Dependencies

perl
RNAplex (from Vienna RNA package)

RNAplex must be executable from your PATH. GSTAr.pl was developed using
RNAplex version RNAplex 2.1.3. It has not been tested on other versoins
of RNAplex.

Installation

No "real" installation. If the script is in your working directory, you
can call it with

./GSTAr.pl

For convenience, you can add it to your PATH. e.g.

sudo mv GSTAr.pl /usr/bin/

GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1
(the hashbang) accordingly.

USAGE
GSTAr.pl [options] queries.fasta transcriptome.fasta

Output alignments go to STDOUT, and can be redirected to a file with >
or piped to another process with |

Log and progress information goes to STDERR, and can be suppressed with
option -q (quiet mode).

Options

Options:

-h Print help message and quit

-v Print version and quit

-q Quiet mode .. no log/progress information to STDERR

-t Tabular output format ... More suitable for parsing

-a Sort by Allen et al. score instead of the default MFEratio

-r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65

METHODS
GSTAr.pl is essentially a wrapper and parser for RNAplex desgined for
aligning short queries (15-26nts) against the rev-comp. strand of a
eukaryotic transcriptome. For each query sequence, the minimum free
energy of a perfectly complelementary sequence is calculated using
RNAplex under all default parameters. Following this, the same query is
then analyzed against the entire transcriptome. Hits where the MFEratio
(i.e. MFE / MFE-perfect) is >= the cutoff established by option r are
retained and parsed.

The detailed RNAplex parameters (see RNAplex man page for details) for
each query analysis are

-f 2 : Fast mode .. structure based on approximated plex model.
-e [minMFEratio * perfectMFE] : Minimum acceptable MFE value to keep a hit
-z : [10 + query_length] : Acceptable alignments can span no more than length of query + 10nts.

Slice Site is the transcript nt opposite nt 10 of the query. This is
where one should look to find evidence of AGO-catalyzed slicing in the
event that a) the transcript was really a target of the query at that
site, and b) it really was sliced. GSTAr makes no judgements on the
likelihood of either of those events, and the recording of the Slice
Site position should NOT be taken as evidence that slicing exists or is
even possible at that site.

For a given query, all returned sites are non-redundant. Redunancy is
based upon the putative slice site only. By default, the output is
sorted in descending order (best to worst) according to the MFE ratio.
In the alternative option -a mode, the results are instead sorted in
ascending order (best to worst) according to the Allen et al. score.

Allen et al. score: This is a score for plant miRNA/siRNA-target
interactions based on the position-specific penalties described by Allen
et al. (2005) Cell, 121:207-221 [PMID: 15851028]. Specifically,
mismatched query bases or target-bulged bases, are penalized 1. G-U
wobbles are penalized 0.5. These penalties are double within positions
2-13 of the query.

WARNINGS
NOT a target predictor

GSTAr is very explicitly NOT a target predictor for miRNAs or siRNAs. It
is only an aligner based on RNA-RNA hybridization thermodynamic
predictions. Users should make no claims as to whether the identified
alignments are actually targets of the query without independent data of
some sort.

Slice Sites are NOT predictions of slicing

Although GSTAr reports a "Slicing Site" position for each alignment,
this is merely for conveneince when using GSTAr alignments to guide
subsequent experiments searching for AGO-catalyzed slicing evidence. No
claim is made that any alignment is actually AGO-cleaved or even
theoretically AGO-cleavable.

Not for whole genomes

GSTAr holds the entire contents of the transcripts.fasta file in memory
to speed the isolation of sub-sequences. This will be impractical in
terms of memory usage if a user attempts to load a whole genome.
Similarly, GSTAr will only search for pairing between the top strand of
the transcripts.fasta file, making it also impractical for a genome
analysis, where sites might be on either strand.

Temp files

GSTAr writes temp files to the working directory. Their contents change
dynamically during a run, and they will be deleted at the end of a run.
So, don't mess with them during a run. In addition, it is a very bad
idea to have two GSTAr runs operating concurrently from the same working
directory because there will be clashes and overwrites for these temp
files.

Not too fast

GSTAr uses RNAplex (Tafer and Hofacker, 2008. Bioinformatics 24:2657-63,
PMID: 18434344, doi:10.1093/bioinformatics/btn193), which is
exceptionally fast for an inter-molecular RNA-RNA hybridization
calculator. However, when applied to entire eukaryotic transcriptomes
the CPU time per query is still significant. Run time is only slightly
affected (much less than 2-fold) by the setting of -r. Setting tabular
mode (option -t) also increases speed just a tiny bit for runs with a
low option -r. In tests with the Arabidopsis transcriptome (33,602
mRNAs, total nts=51,074,197), a single 21nt miRNA query typically takes
about 90-110 seconds to complete.

No ambiguity codes

Query sequences with characters other than A, T, U, C, or G
(case-insensitive) will not be analyzed, and a warning will be sent to
the user. Transcript sub-sequences for potential alignments will be
*silently* ignored if they contain any characters other than A, T, U,
C,or G (case-insensitive).

Small queries

Query sequences must be small (between 15 and 26nts). Queries that don't
meet these size requirements will not be analyzed and a warning sent to
the user.

Redundant output

GSTAr guarantees that, FOR A GIVEN QUERY, the returned alignments will
be unique in terms of their PUTATIVE SLICING SITE POSITION. However, the
same query could generate multiple overlapping aligments that each have
different putative slice sites. Furthermore, if different queries in a
multi-query analysis have similar (or identical !) sequences, the same
alignment position (based on putative slicing site position) could be
returned multiple times, once for each of the similar/identical queries.
Therefore, for multi-query result files, there is no guarantee of
non-redundancy among the returned sites.

OUTPUT
Both output formats begin with a series of commented lines that provide
details about the run.

Pretty output

The default output "pretty" mode is easily parsed by humans and should
be self-explanatory

Tabular output

Tabular output (which occurs when option -t is specified) is a
tab-delimited text file. The first non-commented line is the column
headings, with meanings as follows:

1: Query: Name of query

2: Transcript: Name of transcript

3: TStart: One-based start position of the alignment within the
transcript

4: TStop: One-based stop position of the alignment within the transcript

5: TSLice: One-based position of the alignment opposite position 10 of
the query

6: MFEperfect: Minimum free energy of a perfectly matched site
(approximate)

7: MFEsite: Minimum free energy of the alignment in question

8: MFEratio: MFEsite / MFEperfect

9: AllenScore: Penalty score calculated per Allen et al. (2005) Cell,
121:207-221 [PMID: 15851028].

10: Paired: String representing paired positions in the query and
transcript. The format is Query5'-Query3',Transcript3'-Transcript5'.
Positions are one-based. Discrete blocks of pairing are separated by ;

11: Unpaired: String representing unpaired positions in the query and
transcript. The format is
Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are
"UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3'
end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric
internal loop with more unpaired nts on the transcript side), "AILq"
(asymmetric internal loop with more unpaired nts on the query side),
"BULt" (Bulged on the transcript side), or "BULq" (bulged on the query
side). Positions are one-based. Discrete blocks of pairing are separated
by ;

12: Structure: Aligned secondary structure. The region before the "&"
represents the transcript, 5'-3', while the region after the "&"
represents the query, 5'-3'. "(" represents a transcript base that is
paired, ")" represents a query based that is paired, "." represents an
unpaired base, and "-" represents a gap inserted to facilitate
alignment.

13: Sequence: Aligned sequence. The region before the "&" represents the
transcript, 5'-3', while the region after the "&" represents the query,
5'-3'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

MikeAxtell/GSTAr

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages