Description
Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury
It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.
The fit values usually range within the 90's so I wonder how useful/accurate this approach is...
Furthermore, it seems that in some cases people report moderately different results from GS using different k's.
In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms.
k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe
Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.
The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).
Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k for running GS?
Thanks in advance,
Ben
In general do you have best practice re
Activity
mschatz commentedon May 7, 2020
bmansfeld commentedon May 8, 2020
Mike as usual thanks for the quick and informative reply. The example helps understand a lot.
Oh and I agree I think these numbers are all fair estimates in both values of k.
I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms?
Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid.
But looking at the distributions I noticed that the 1x-het-peak looks shorter with that value, and was wondering if we were excluding potential hets. As you say, larger k more unique kmers.
So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts?
Thanks again,
Ben
mschatz commentedon May 11, 2020
bmansfeld commentedon May 13, 2020
Thanks mike! I'll re-read the supplement.
ayerdk commentedon Mar 26, 2025
Thank you all the experts! When experienced persons talk, listeners or readers get the benefit of knowledge at no cost! Please keep sharing the knowledge infinitely.