Skip to content

signatureEnrichment not assigning all samples to signatures #607

@alexyfyf

Description

@alexyfyf

Describe the issue
Hi, I found that signatureEnrichment function doesn't assign all samples in the maf file to certain signature.
I found it when running with custom data, and noticed the same problem in your vignettes as well. https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

Command

> laml@summary
                   ID          summary  Mean Median
 1:        NCBI_Build               37    NA     NA
 2:            Center genome.wustl.edu    NA     NA
 3:           Samples              193    NA     NA
 4:            nGenes             1241    NA     NA
 5:   Frame_Shift_Del               52 0.269      0
 6:   Frame_Shift_Ins               91 0.472      0
 7:      In_Frame_Del               10 0.052      0
 8:      In_Frame_Ins               42 0.218      0
 9: Missense_Mutation             1342 6.953      7
10: Nonsense_Mutation              103 0.534      0
11:       Splice_Site               92 0.477      0
12:             total             1732 8.974      9

laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig)
## 
## Signature_1 Signature_2 Signature_3 
##          60          65          63

So you can also notice in laml, sample size is 193, but when running signatureEnrichment, only 188 samples were assigned.
I was wondering if I missed something.
Can you help to explain this?
I have my dataset which includes 12 samples, but only 6 were assigned, which I assume is the same problem.

Thank you.

Session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] maftools_2.4.15

loaded via a namespace (and not attached):
 [1] compiler_4.0.2     Matrix_1.2-18      tools_4.0.2        RColorBrewer_1.1-2
 [5] survival_3.2-3     R.methodsS3_1.8.1  splines_4.0.2      grid_4.0.2        
 [9] data.table_1.13.0  R.utils_2.10.1     R.oo_1.24.0        lattice_0.20-41 

Activity

ShixiangWang

ShixiangWang commented on Sep 11, 2020

@ShixiangWang
Contributor

@alexyfyf This is because several samples without SBS mutation are dropped.

> dim(laml.tnm$nmf_matrix)
[1] 188  96

You should check your data with this and report the result.

PoisonAlien

PoisonAlien commented on Sep 11, 2020

@PoisonAlien
Owner

Hi,
I am guessing mutation load is quite low for your samples? This could lead to exclusion of samples.

alexyfyf

alexyfyf commented on Sep 11, 2020

@alexyfyf
Author

Thank you for the reply. Looks like it's not because of that. I'll post the codes here:

Here's my maf object with 12 samples

> maf_sub
An object of class  MAF 
                        ID summary     Mean Median
 1:             NCBI_Build  GRCm38       NA     NA
 2:                 Center       .       NA     NA
 3:                Samples      12       NA     NA
 4:                 nGenes    3349       NA     NA
 5:        Frame_Shift_Del      82    6.833    7.0
 6:        Frame_Shift_Ins    4402  366.833  364.5
 7:           In_Frame_Del      20    1.667    2.0
 8:           In_Frame_Ins      35    2.917    3.0
 9:      Missense_Mutation    2337  194.750  196.5
10:      Nonsense_Mutation      18    1.500    1.0
11:       Nonstop_Mutation       6    0.500    0.5
12:            Splice_Site   16964 1413.667 1428.0
13: Translation_Start_Site       3    0.250    0.0
14:                  total   23867 1988.917 1995.0

Here's the tnm also 12*96

tnm <- trinucleotideMatrix(maf_sub, prefix = "chr", add = TRUE,
                           ref_genome = "BSgenome.Mmusculus.UCSC.mm10")
-Extracting 5' and 3' adjacent bases
-Extracting +/- 20bp around mutated bases for background C>T estimation
-Estimating APOBEC enrichment scores
--Performing one-way Fisher's test for APOBEC enrichment
---APOBEC related mutations are enriched in  0 % of samples (APOBEC enrichment score > 2 ;  0  of  12  samples)
-Creating mutation matrix
--matrix of dimension 12x96

> dim(tnm$nmf_matrix)
[1] 12 96

Then extract 3 signatures

> signature <- extractSignatures(tnm, n=3)
-Running NMF for factorization rank: 3
-Finished in4.987s elapsed (2.161s cpu)

> dim(signature$contributions)
[1]  3 12

Finally, enrichment, but only 6 samples assigned

> sigenrich <- signatureEnrichment(maf_sub, signature, minMut = 2)
Running k-means for signature assignment..
Performing pairwise and groupwise comparisions..
Sample size per factor in Signature:

Signature_1 Signature_2 Signature_3 
          2           2           2 

> sigenrich$Signature_Assignment
   Tumor_Sample_Barcode   Signature
1:                   L5 Signature_1
2:                   L6 Signature_1
3:                   L2 Signature_2
4:                   L2 Signature_3
5:                   w3 Signature_2
6:                   w3 Signature_3

I also find if I change the number of signatures to 2, the signatureEnrichment can work well assigning 4 and 8 samples to each signature, which is correct (12 in total), but for the example above using 3 signatures, there's seems to be something wrong.

Thank you if you could help.

ShixiangWang

ShixiangWang commented on Sep 11, 2020

@ShixiangWang
Contributor

Should be a bug, could you post your data for debugging?

PoisonAlien

PoisonAlien commented on Sep 11, 2020

@PoisonAlien
Owner

I also see that you are using mouse genome. Are these data from mice? I don't think this should affect but let me see if I can reproduce the issue. As @ShixiangWang suggested it would help a great deal if you could share your tnm object (as an RDs or rdata)

alexyfyf

alexyfyf commented on Sep 12, 2020

@alexyfyf
Author

I also see that you are using mouse genome. Are these data from mice? I don't think this should affect but let me see if I can reproduce the issue. As @ShixiangWang suggested it would help a great deal if you could share your tnm object (as an RDs or rdata)

Yes it's mouse data. I'm not sure how to share the data here.
I put it on google drive, and here's the link. But somehow the extension is lost, you need download and gunzip, and read it as a rds file.
Please let me know if you can load the data. Thank you so much.

PoisonAlien

PoisonAlien commented on Sep 14, 2020

@PoisonAlien
Owner

Hi @alexyfyf
Thanks for sharing the file. I can reproduce your issue. I will have a look and let you soon. Sorry things are quite busy.

ShixiangWang

ShixiangWang commented on Sep 15, 2020

@ShixiangWang
Contributor

@PoisonAlien If you need help, @me at any time

PoisonAlien

PoisonAlien commented on Sep 15, 2020

@PoisonAlien
Owner

Thanks @ShixiangWang for the offer to help. I will definitely keep in mind..

Hi @alexyfyf ,
Two points..

  1. Going through your data I realized that the number of signatures (N = 3) are bit too low. You should always run estimateSignatures first and get the ideal number of signatures.
sig_est <- estimateSignatures(mat = tnm, nTry = 8,parallel = 6)
-Running NMF for 8 ranks
Compute NMF rank= 2  ... + measures ... OK
Compute NMF rank= 3  ... + measures ... OK
Compute NMF rank= 4  ... + measures ... OK
Compute NMF rank= 5  ... + measures ... OK
Compute NMF rank= 6  ... + measures ... OK
Compute NMF rank= 7  ... + measures ... OK
Compute NMF rank= 8  ... + measures ... OK
-Finished in 54.2s elapsed (12.0s cpu) 

plotCophenetic(res = sig_est)

Rplot

Above plot shows 6 would be a good number since the correlation reaches maximum.

  1. Regarding the missing samples, it seems k-means is struggling to classify samples. Maybe do not use the function for now. I will figure out the details.. Apologies for the inconvenience.
alexyfyf

alexyfyf commented on Sep 16, 2020

@alexyfyf
Author

@PoisonAlien
Thank you for explaining this. I did the plotCophenetic, and I initially thought I should choose the smallest number which gives a high score. I though large n will introduce more noise. So I picked 3, which is the second-highest here. Maybe I'm not understanding this metric very well.
Do you suggest to choose the n with highest score?

PoisonAlien

PoisonAlien commented on Sep 16, 2020

@PoisonAlien
Owner

The idea is to look for the point at which it reaches max value and drops significantly. Here it could be 4 or 6. You can run for both and decide upon the number - in case if you think 6 is an overkill use 4. This is always tricky and never the black&white. Hope this helps.

alexyfyf

alexyfyf commented on Sep 17, 2020

@alexyfyf
Author

Thank you for the explanation.
One more related issue is when I used number = 4 for signature extraction as suggested. And then I ran signatureEnrichment and it actually returned 13 samples, however, I only have 12 in the maf_sub object in the link.
You probably can find the same if you run my data.
Likely those two bugs are related and both point to the k-means.

PoisonAlien

PoisonAlien commented on Sep 17, 2020

@PoisonAlien
Owner

Yes, definitely that is the case. Its better to skip the function for now..

PoisonAlien

PoisonAlien commented on Oct 6, 2020

@PoisonAlien
Owner

Hello @alexyfyf
Thanks for reporting the issue. I decided to drop the function entirely since I do not have enough time to fix it. This function will now outputs a warning message - not to use it for any interpretation. I will gradually remove it from the package. I apologize for the inconvenience. I am closing the issue for now, please feel free to reopen if necessary.

alexyfyf

alexyfyf commented on Oct 11, 2020

@alexyfyf
Author

Thank you for letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @PoisonAlien@alexyfyf@ShixiangWang

        Issue actions

          signatureEnrichment not assigning all samples to signatures · Issue #607 · PoisonAlien/maftools