Skip to content

% of reads unmapped: too short is HUGE #169

Closed
@bostanict

Description

@bostanict

I am mapping several collections with star and I am extremely satisfied with the speed and results, but for some libraries, the number of reads unmapped and flagged as short is huge. as an example, this is one of the log files:

Started job on |    Jul 07 14:50:01
                             Started mapping on |   Jul 07 14:52:00
                                    Finished on |   Jul 07 15:00:20
       Mapping speed, Million of reads per hour |   774.76

                          Number of input reads |   107605440
                      Average input read length |   98
                                    UNIQUE READS:
                   Uniquely mapped reads number |   10642081
                        Uniquely mapped reads % |   9.89%
                          Average mapped length |   97.98
                       Number of splices: Total |   1652808
            Number of splices: Annotated (sjdb) |   0
                       Number of splices: GT/AG |   175313
                       Number of splices: GC/AG |   24635
                       Number of splices: AT/AC |   14999
               Number of splices: Non-canonical |   1437861
                      Mismatch rate per base, % |   2.07%
                         Deletion rate per base |   0.00%
                        Deletion average length |   1.62
                        Insertion rate per base |   0.02%
                       Insertion average length |   1.90
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   7076476
             % of reads mapped to multiple loci |   6.58%
        Number of reads mapped to too many loci |   1614447
             % of reads mapped to too many loci |   1.50%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   12.33%
                 % of reads unmapped: too short |   62.64%
                     % of reads unmapped: other |   7.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

in which more than half of collection is considered as short and filtered out. I checked several forums and in some cases they were suggesting to specify the --sjdbOverhang parameter relevant to read size but I have no annotation for this genome and STAR does not accept this parameter with out the gtf file.

my reads are pair-end and the library with shortest length is 2_50 bp with 4kb insert size and as of the above collection is 2_98bp with 20kb insert size.

I also tried the mapping with lower quality score and also higher number of mismatches but seems the reads are filtered out in the initial steps due to read length before mapping or so on~

is there any solution to this problem since I saw many different users are having the same problem with no concrete way around.

I must also say that the library is clean, I also checked for the contamination and low quality sequences.

Thanks a lot

Activity

alexdobin

alexdobin commented on Jul 12, 2016

@alexdobin
Owner

Hi @bostanict
in the above results seem to contain short reads with average sum of two mates = 98b.
Are you trimming the reads?
You can increase the number of mapped reads by relaxing the requirements on the mapped length, e.g.: --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3

Cheers
Alex

bostanict

bostanict commented on Jul 28, 2016

@bostanict
Author

Dear Alex,

Thanks for all the supports, here and also via email.

since I saw this issue in many different places, I describe the parameters I used as Alex advised and I could overcome this issue accordingly:

--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0

the number of missmaches I controlled by --outFilterMismatchNmax 2 and got my preferred results.

I do all my filterings after on the sam file :)

Thanks alot~

Gig77

Gig77 commented on Dec 15, 2016

@Gig77

I encountered this issue when in the two paired-end input FASTQ files mates are out-of-order, i.e. mates are not found at the same line of the two files. This leads to a lot of not properly mapped read pairs that STAR throws into the "too short" bucket.

koenvandenberge

koenvandenberge commented on Mar 23, 2017

@koenvandenberge

@Gig77 Thank you for your comment.
May I ask how you repaired the ordering of the mates in the FASTQ files?
Thanks,
Koen

Gig77

Gig77 commented on Mar 23, 2017

@Gig77

@koenvandenberge I sorted both FASTQ files by read name, which solved the issue for me. I also reported the issue that caused the unsorted FASTQ files in the first place, but I don't think it is fixed yet (ticket #222).

colinwxl

colinwxl commented on Apr 12, 2018

@colinwxl

@alexdobin @bostanict I've encountered the situation too. I am wondering if it is reasonable to set --outFilterScoreMinOverLread 0 and --outFilterMatchNminOverLread 0, while set --outFilterMismatchNmax 2. Hope you can deliver some explanation. Thanks

alexdobin

alexdobin commented on Apr 17, 2018

@alexdobin
Owner

Hi Colin

could you please open a new issue and explain the problem you are seeing from the beginning?
This issue is too old to re-open it.

Cheers
Alex

skchronicles

skchronicles commented on Jun 8, 2023

@skchronicles

I just helped someone with this issue, but it was due to an unrelated reason. A user ran our RNA-seq pipeline with the wrong reference genome selected. They had human samples but they ran the pipeline with a reference genome built using mouse, GRCm38.

I just wanted to comment on this issue as it may save you some headaches later. Before you lose some time debugging/digging into this problem, please make sure you used the correct STAR index for your samples.

Just as a reference, here are some statistics from MultiQC:
image

image

TLDR:
Make sure you are using the correct STAR index for your samples! This is what happens when you map human samples against a mouse reference genome.

katlande

katlande commented on Nov 7, 2023

@katlande

Another potential source for this problem is one I just ran into. I got a similar result to this when I concatenated R1 and R2 fastqs in the wrong order prior to alignment:

i.e.,
lane1_R1.fq.gz lane2_R1.fq.gz > R1.fq.gz
lane2_R2.fq.gz lane1_R2.fq.gz > R2.fq.gz

Re-concatenating my R2 as lane1_R2.fq.gz lane2_R2.fq.gz > R2.fq.gz

Brought the % of reads unmapped: too short down from ~96% to ~2%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Gig77@alexdobin@koenvandenberge@colinwxl@skchronicles

        Issue actions

          % of reads unmapped: too short is HUGE · Issue #169 · alexdobin/STAR