Parameter tweaking for assembling heterozygous sample #201

suryasaha · 2016-07-19T13:58:06Z

I am assembling an heterogeneous sample (pool of sexually reproducing insects) using the latest version of canu. Raw coverage was 80X (avg length 7.2kb) that reduces to 26X after correction using canu.

I have only tried defaults except decreasing the error rate to 0.015 and 0.013 as advised for high coverage samples. Would you recommend any alternative parameter choices when handling such a sample? Thanks

skoren · 2016-07-19T14:37:05Z

The drop in coverage from correction is expected since it will only correct the longest 40X of data which is being reduced to 26X.

I think what you'd want to do is use the best overlaps for correction and correct as much data as possible. To do that I'd set `corOutCoverage=100 corMaxEvidenceErate=0.15 corMhapSensitivity=normal``
This will keep the best overlaps only to avoid mixing reads as much as possible and use them to correct all the data it can to give more coverage for assembly. Canu already computes all-pairs overlaps so you could re-use your previous run as in issue #197 or you can re-start from scratch.

What do the sizes look like in your unitig outputs (005log and 009log, you can do tail on both files)? Based on that it's possible to adjust the unitigging parameters to merge more aggressively.

suryasaha · 2016-07-19T16:43:41Z

Thanks for the quick reply @skoren !

tail  41cells.acp-corrected-trim-err0.013/unitigging/4-unitigger/acp.err0.013.005.popBubbles.thr000.num000.log
potential bubble 25419 had no valid placements (all were not contained in target tig)
potential bubble 25471 had no valid placements (all were not contained in target tig)
potential bubble 25490 had no valid placements (all were not contained in target tig)
potential bubble 25496 had no valid placements (all were not contained in target tig)
potential bubble 25526 had no valid placements (all were not contained in target tig)
potential bubble 25595 had no valid placements (all were not contained in target tig)
potential bubble 25632 had no valid placements (all were not contained in target tig)
potential bubble 25642 had no valid placements (all were not contained in target tig)
potential bubble 25652 had no valid placements (all were not contained in target tig)

I don't have a 009*log. I used errorRate but I think I should have used utgOvlErrorRate. Rerun?

brianwalenz · 2016-07-19T16:44:59Z

If you're more-or-less up to date with github, the sizes are reported in *.sizes files.

suryasaha · 2016-07-19T17:29:26Z

tail 41cells.acp-corrected-trim-err0.013/unitigging/4-unitigger/acp.err0.013.005.popBubbles.sizes 
ng050    316448   lg050      280   sum   225302223  (CONTIGS)
ng060    211773   lg060      455   sum   270039057  (CONTIGS)
ng070    139082   lg070      722   sum   315031409  (CONTIGS)
ng080     95134   lg080     1116   sum   360080370  (CONTIGS)
ng090     69348   lg090     1675   sum   405010046  (CONTIGS)
ng100     51039   lg100     2437   sum   450000458  (CONTIGS)
ng110     37058   lg110     3475   sum   495008941  (CONTIGS)
ng120     26336   lg120     4920   sum   540022412  (CONTIGS)
ng130     18230   lg130     6969   sum   585005345  (CONTIGS)
ng140      8766   lg140    10267   sum   630004066  (CONTIGS)

skoren · 2016-07-19T18:35:56Z

Do you have the other sizes file (004 and 008). It would also be interesting to see 001.filterOverlaps.thr000.num000.log

The next question would be do you want to try to separate out the variation or smash it and assemble the consensus of all the heterozygosity.

suryasaha · 2016-07-19T20:02:24Z

I would like to try to wean out the separate individuals in this population but that might not produce a good assembly for any one. But its definitely worth trying. If that does not work, we would just have to take the other route and smash it all into one consensus assembly.

tail acp.err0.013.001.filterOverlaps.thr000.num000.log
FINAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   28964 reads have no best edges (singleton)
   37584 reads have only one best edge (spur) 
             9805 are mutual best
  181111 reads have two best edges 
             9409 have one mutual best edge
           153914 have two mutual best edges

tail acp.err0.013.004.placeContains.sizes
ng050    316448   lg050      280   sum   225302223  (CONTIGS)
ng060    211773   lg060      455   sum   270039057  (CONTIGS)
ng070    139082   lg070      722   sum   315031409  (CONTIGS)
ng080     95134   lg080     1116   sum   360080370  (CONTIGS)
ng090     69348   lg090     1675   sum   405010046  (CONTIGS)
ng100     51048   lg100     2437   sum   450006634  (CONTIGS)
ng110     37085   lg110     3475   sum   495029098  (CONTIGS)
ng120     26344   lg120     4918   sum   540001280  (CONTIGS)
ng130     18324   lg130     6962   sum   585004130  (CONTIGS)
ng140      9271   lg140    10207   sum   630001704  (CONTIGS)

tail acp.err0.013.008.generateOutputs.sizes
ng030    314709   lg030      240   sum   135223559  (CONTIGS)
ng040    226133   lg040      408   sum   180150489  (CONTIGS)
ng050    162254   lg050      646   sum   225050948  (CONTIGS)
ng060    121695   lg060      968   sum   270086466  (CONTIGS)
ng070     90000   lg070     1399   sum   315042039  (CONTIGS)
ng080     69137   lg080     1972   sum   360006092  (CONTIGS)
ng090     53284   lg090     2716   sum   405014332  (CONTIGS)
ng100     41088   lg100     3682   sum   450014099  (CONTIGS)
ng110     31167   lg110     4945   sum   495022065  (CONTIGS)
ng120     22122   lg120     6645   sum   540021100  (CONTIGS)

skoren · 2016-07-19T20:10:30Z

First, you can try improving the current unitigging which should be fast. You can copy the 4-unitigger/unitigger.sh which should look like:

#!/bin/sh

if [ -e chm1/unitigging/asm.tigStore/seqDB.v001.tig ] ; then
  exit 0
fi

syst=`uname -s`
arch=`uname -m`
name=`uname -n`

if [ "$arch" = "x86_64" ] ; then
  arch="amd64"
fi
if [ "$arch" = "Power Macintosh" ] ; then
  arch="ppc"
fi

bin="canu-1.3/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="canu-1.3"
fi


$bin/bogart \
 -G chm1/unitigging/asm.gkpStore \
 -O chm1/unitigging/asm.ovlStore \
 -T chm1/unitigging/asm.tigStore.WORKING \
 -o chm1/unitigging/4-unitigger/asm \
 -B 21237 \
 -gs 3100000000 \
 -eg 0.039 \
 -eM 0.039 \
 -el 500 \
 -dg 6 \
 -db 6 \
 -dr 3 \
 -ca 5000 \
 -cp 500 \
 -threads 16 \
 -M 168 \
 -unassembled 2 1000 0.75 0.75 2 \
 > chm1/unitigging/4-unitigger/unitigger.err 2>&1 \
&& \
mv chm1/unitigging/asm.tigStore.WORKING chm1/unitigging/asm.tigStore.FINISHED

exit 0

file to a new folder (say 4-test) and edit the -T and -o to point to 4-test instead of 4-unitigger. Update the standard deviation allowed for assembly to be lower and drop the repeat breaking thresholds so the file looks like:

#!/bin/sh

syst=`uname -s`
arch=`uname -m`
name=`uname -n`

if [ "$arch" = "x86_64" ] ; then
  arch="amd64"
fi
if [ "$arch" = "Power Macintosh" ] ; then
  arch="ppc"
fi

bin="canu-1.3/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="canu-1.3"
fi


$bin/bogart \
 -G chm1/unitigging/asm.gkpStore \
 -O chm1/unitigging/asm.ovlStore \
 -T chm1/unitigging/4-test/asm.tigStore \
 -o chm1/unitigging/4-test/asm \
 -B 21237 \
 -gs 3100000000 \
 -eg 0.039 \
 -eM 0.039 \
 -el 500 \
 -dg 2 \
 -db 2 \
 -dr 1 \
 -ca 500 \
 -cp 50 \
 -threads 16 \
 -M 168 \
 -unassembled 2 1000 0.75 0.75 2 \
 > unitigger.err 2>&1

exit 0

This will do a better job separating the individuals and you'll likely end up with 1.5-2x your expected genome size. If you want to smash instead, increase the dg and db. Trying this on your current assembly is quick and should increase your NG50. However, I think you want to run the correction to get more output coverage so you don't have coverage gaps when assembling as I suggested in an earlier post.

brianwalenz · 2016-07-19T20:19:12Z

What's the start of acp.err0.013.001.filterOverlaps.thr000.num000.log have? It's a small file. The median error rate is more or less what overlaps are used. You might get a small gain in separation by dropping -eg and -eM - these will prevent bogart from even seeing the high error overlaps at all.

Comparing the same file with later attempts will give an indication when you go too far - I expect the number of reads with two best edges -- in particular, two mutual best edges, will plummet.

For smashy-smashy, you'd want to increase -eg and -eM, but your overlaps might not be computed that high to begin with. I think they're at 3.9% == 0.039. Check the script in 1-overlapper. In which case, you'll be computing overlaps again.

suryasaha · 2016-07-19T21:04:24Z

Hi @brianwalenz Yes, -eg and -eM are at 0.039.

I will try to improve the correction first and then optimize -dg and -db as well as -eg and -eM unless acp.err0.013.001.filterOverlaps.thr000.num000.log suggests otherwise.

cat acp.err0.013.001.filterOverlaps.thr000.num000.log

INITIAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   21389 reads have no best edges (singleton)
   25546 reads have only one best edge (spur) 
            17517 are mutual best
  200724 reads have two best edges 
            19813 have one mutual best edge
           177801 have two mutual best edges


ERROR RATES (2433306 samples)
-----------
mean   0.01146215 stddev 0.00897723 -> 0.06532555 fraction error =   6.532555% error
median 0.00880000 mad    0.00510000 -> 0.05416756 fraction error =   5.416756% error


EDGE FILTERING
-------- ------------------------------------------
       0 reads have a suspicious overlap pattern
       0 reads had edges filtered
                0 had one
                0 had two
   20189 reads have length incompatible edges
            17883 have one
             2306 have two

FINAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   28964 reads have no best edges (singleton)
   37584 reads have only one best edge (spur) 
             9805 are mutual best
  181111 reads have two best edges 
             9409 have one mutual best edge
           153914 have two mutual best edges

tail -20 1-overlapper/overlap.sh 
  bin="/data/.../canu"
fi


$bin/overlapInCore \
  -t 6 \
  -k 22 \
  -k canu/41cells.acp-corrected-trim-err0.013/unitigging/0-mercounts/acp.err0.013.ms22.frequentMers.fasta \
  --hashbits 23 \
  --hashload 0.75 \
  --maxerate  0.039 \
  --minlength 500 \
  $opt \
  -o canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.WORKING.gz \
  -s canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.stats \
 canu/41cells.acp-corrected-trim-err0.013/unitigging/acp.err0.013.gkpStore \
&& \
mv canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.WORKING.gz canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.gz

exit 0

brianwalenz · 2017-01-05T19:11:40Z

Moving from active issue to feature request. https://github.com/marbl/canu/projects/1

JohnUrban mentioned this issue Jul 19, 2016

quick question(s) on heterozygosity #190

Closed

JohnUrban mentioned this issue Jul 28, 2016

Clarification on one of the FAQs #210

Closed

skoren mentioned this issue Dec 30, 2016

Reads of insert vs subreads usage with canu #320

Closed

brianwalenz closed this as completed Jan 5, 2017

Gon1976 mentioned this issue Jun 24, 2021

cns Failed #1964

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parameter tweaking for assembling heterozygous sample #201

Parameter tweaking for assembling heterozygous sample #201

suryasaha commented Jul 19, 2016

skoren commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

brianwalenz commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

skoren commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

skoren commented Jul 19, 2016

Uh oh!

brianwalenz commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016 •

edited

Loading

Uh oh!

brianwalenz commented Jan 5, 2017

Uh oh!

Parameter tweaking for assembling heterozygous sample #201

Parameter tweaking for assembling heterozygous sample #201

Comments

suryasaha commented Jul 19, 2016

skoren commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

brianwalenz commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

skoren commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016

Uh oh!

skoren commented Jul 19, 2016

Uh oh!

brianwalenz commented Jul 19, 2016

Uh oh!

suryasaha commented Jul 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianwalenz commented Jan 5, 2017

Uh oh!

suryasaha commented Jul 19, 2016 •

edited

Loading