Description
I am mapping several collections with star and I am extremely satisfied with the speed and results, but for some libraries, the number of reads unmapped and flagged as short is huge. as an example, this is one of the log files:
Started job on | Jul 07 14:50:01
Started mapping on | Jul 07 14:52:00
Finished on | Jul 07 15:00:20
Mapping speed, Million of reads per hour | 774.76
Number of input reads | 107605440
Average input read length | 98
UNIQUE READS:
Uniquely mapped reads number | 10642081
Uniquely mapped reads % | 9.89%
Average mapped length | 97.98
Number of splices: Total | 1652808
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 175313
Number of splices: GC/AG | 24635
Number of splices: AT/AC | 14999
Number of splices: Non-canonical | 1437861
Mismatch rate per base, % | 2.07%
Deletion rate per base | 0.00%
Deletion average length | 1.62
Insertion rate per base | 0.02%
Insertion average length | 1.90
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 7076476
% of reads mapped to multiple loci | 6.58%
Number of reads mapped to too many loci | 1614447
% of reads mapped to too many loci | 1.50%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 12.33%
% of reads unmapped: too short | 62.64%
% of reads unmapped: other | 7.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
in which more than half of collection is considered as short and filtered out. I checked several forums and in some cases they were suggesting to specify the --sjdbOverhang parameter relevant to read size but I have no annotation for this genome and STAR does not accept this parameter with out the gtf file.
my reads are pair-end and the library with shortest length is 2_50 bp with 4kb insert size and as of the above collection is 2_98bp with 20kb insert size.
I also tried the mapping with lower quality score and also higher number of mismatches but seems the reads are filtered out in the initial steps due to read length before mapping or so on~
is there any solution to this problem since I saw many different users are having the same problem with no concrete way around.
I must also say that the library is clean, I also checked for the contamination and low quality sequences.
Thanks a lot
Activity
alexdobin commentedon Jul 12, 2016
Hi @bostanict
in the above results seem to contain short reads with average sum of two mates = 98b.
Are you trimming the reads?
You can increase the number of mapped reads by relaxing the requirements on the mapped length, e.g.: --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
Cheers
Alex
bostanict commentedon Jul 28, 2016
Dear Alex,
Thanks for all the supports, here and also via email.
since I saw this issue in many different places, I describe the parameters I used as Alex advised and I could overcome this issue accordingly:
--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0
the number of missmaches I controlled by --outFilterMismatchNmax 2 and got my preferred results.
I do all my filterings after on the sam file :)
Thanks alot~
Gig77 commentedon Dec 15, 2016
I encountered this issue when in the two paired-end input FASTQ files mates are out-of-order, i.e. mates are not found at the same line of the two files. This leads to a lot of not properly mapped read pairs that STAR throws into the "too short" bucket.
% of reads unmapped: too short
mean (short reads or short part of the read unaligned)? #164koenvandenberge commentedon Mar 23, 2017
@Gig77 Thank you for your comment.
May I ask how you repaired the ordering of the mates in the FASTQ files?
Thanks,
Koen
Gig77 commentedon Mar 23, 2017
@koenvandenberge I sorted both FASTQ files by read name, which solved the issue for me. I also reported the issue that caused the unsorted FASTQ files in the first place, but I don't think it is fixed yet (ticket #222).
colinwxl commentedon Apr 12, 2018
@alexdobin @bostanict I've encountered the situation too. I am wondering if it is reasonable to set
--outFilterScoreMinOverLread 0
and--outFilterMatchNminOverLread 0
, while set--outFilterMismatchNmax 2
. Hope you can deliver some explanation. Thanksalexdobin commentedon Apr 17, 2018
Hi Colin
could you please open a new issue and explain the problem you are seeing from the beginning?
This issue is too old to re-open it.
Cheers
Alex
skchronicles commentedon Jun 8, 2023
I just helped someone with this issue, but it was due to an unrelated reason. A user ran our RNA-seq pipeline with the wrong reference genome selected. They had human samples but they ran the pipeline with a reference genome built using mouse, GRCm38.
I just wanted to comment on this issue as it may save you some headaches later. Before you lose some time debugging/digging into this problem, please make sure you used the correct STAR index for your samples.
Just as a reference, here are some statistics from MultiQC:

TLDR:
Make sure you are using the correct STAR index for your samples! This is what happens when you map human samples against a mouse reference genome.
katlande commentedon Nov 7, 2023
Another potential source for this problem is one I just ran into. I got a similar result to this when I concatenated R1 and R2 fastqs in the wrong order prior to alignment:
i.e.,
lane1_R1.fq.gz lane2_R1.fq.gz > R1.fq.gz
lane2_R2.fq.gz lane1_R2.fq.gz > R2.fq.gz
Re-concatenating my R2 as lane1_R2.fq.gz lane2_R2.fq.gz > R2.fq.gz
Brought the % of reads unmapped: too short down from ~96% to ~2%