-
Notifications
You must be signed in to change notification settings - Fork 179
Parameter tweaking for assembling heterozygous sample #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The drop in coverage from correction is expected since it will only correct the longest 40X of data which is being reduced to 26X. I think what you'd want to do is use the best overlaps for correction and correct as much data as possible. To do that I'd set `corOutCoverage=100 corMaxEvidenceErate=0.15 corMhapSensitivity=normal`` What do the sizes look like in your unitig outputs (005log and 009log, you can do tail on both files)? Based on that it's possible to adjust the unitigging parameters to merge more aggressively. |
Thanks for the quick reply @skoren !
I don't have a 009*log. I used |
If you're more-or-less up to date with github, the sizes are reported in *.sizes files. |
|
Do you have the other sizes file (004 and 008). It would also be interesting to see 001.filterOverlaps.thr000.num000.log The next question would be do you want to try to separate out the variation or smash it and assemble the consensus of all the heterozygosity. |
I would like to try to wean out the separate individuals in this population but that might not produce a good assembly for any one. But its definitely worth trying. If that does not work, we would just have to take the other route and smash it all into one consensus assembly.
|
First, you can try improving the current unitigging which should be fast. You can copy the 4-unitigger/unitigger.sh which should look like:
file to a new folder (say 4-test) and edit the -T and -o to point to 4-test instead of 4-unitigger. Update the standard deviation allowed for assembly to be lower and drop the repeat breaking thresholds so the file looks like:
This will do a better job separating the individuals and you'll likely end up with 1.5-2x your expected genome size. If you want to smash instead, increase the dg and db. Trying this on your current assembly is quick and should increase your NG50. However, I think you want to run the correction to get more output coverage so you don't have coverage gaps when assembling as I suggested in an earlier post. |
What's the start of acp.err0.013.001.filterOverlaps.thr000.num000.log have? It's a small file. The median error rate is more or less what overlaps are used. You might get a small gain in separation by dropping -eg and -eM - these will prevent bogart from even seeing the high error overlaps at all. Comparing the same file with later attempts will give an indication when you go too far - I expect the number of reads with two best edges -- in particular, two mutual best edges, will plummet. For smashy-smashy, you'd want to increase -eg and -eM, but your overlaps might not be computed that high to begin with. I think they're at 3.9% == 0.039. Check the script in 1-overlapper. In which case, you'll be computing overlaps again. |
Hi @brianwalenz Yes, -eg and -eM are at 0.039. I will try to improve the correction first and then optimize -dg and -db as well as -eg and -eM unless
|
Moving from active issue to feature request. https://github.com/marbl/canu/projects/1 |
I am assembling an heterogeneous sample (pool of sexually reproducing insects) using the latest version of canu. Raw coverage was 80X (avg length 7.2kb) that reduces to 26X after correction using canu.
I have only tried defaults except decreasing the error rate to 0.015 and 0.013 as advised for high coverage samples. Would you recommend any alternative parameter choices when handling such a sample? Thanks
The text was updated successfully, but these errors were encountered: