Skip to content

Model Fit vs k-mer size #32

Open
Open
@bmansfeld

Description

@bmansfeld

Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury

It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.

The fit values usually range within the 90's so I wonder how useful/accurate this approach is...

Furthermore, it seems that in some cases people report moderately different results from GS using different k's.

In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms.
k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe

Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.

The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).

Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k for running GS?
Thanks in advance,
Ben

In general do you have best practice re

Activity

mschatz

mschatz commented on May 7, 2020

@mschatz
Contributor
bmansfeld

bmansfeld commented on May 8, 2020

@bmansfeld
Author

Mike as usual thanks for the quick and informative reply. The example helps understand a lot.
Oh and I agree I think these numbers are all fair estimates in both values of k.

I noticed your kmer
peaks are truncated at 100,000 which probably contributes to some of this
variance, especially for the genome size. If you increase this to 1,000,000
I would be the values to be even closer.

I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms?

Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid.
But looking at the distributions I noticed that the 1x-het-peak looks shorter with that value, and was wondering if we were excluding potential hets. As you say, larger k more unique kmers.

So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts?

Thanks again,
Ben

mschatz

mschatz commented on May 11, 2020

@mschatz
Contributor
bmansfeld

bmansfeld commented on May 13, 2020

@bmansfeld
Author

Thanks mike! I'll re-read the supplement.

ayerdk

ayerdk commented on Mar 26, 2025

@ayerdk

Thank you all the experts! When experienced persons talk, listeners or readers get the benefit of knowledge at no cost! Please keep sharing the knowledge infinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mschatz@bmansfeld@ayerdk

        Issue actions

          Model Fit vs k-mer size · Issue #32 · schatzlab/genomescope