Skip to content

Parameter tweaking for assembling heterozygous sample #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suryasaha opened this issue Jul 19, 2016 · 10 comments
Closed

Parameter tweaking for assembling heterozygous sample #201

suryasaha opened this issue Jul 19, 2016 · 10 comments

Comments

@suryasaha
Copy link
Contributor

I am assembling an heterogeneous sample (pool of sexually reproducing insects) using the latest version of canu. Raw coverage was 80X (avg length 7.2kb) that reduces to 26X after correction using canu.

I have only tried defaults except decreasing the error rate to 0.015 and 0.013 as advised for high coverage samples. Would you recommend any alternative parameter choices when handling such a sample? Thanks

@skoren
Copy link
Member

skoren commented Jul 19, 2016

The drop in coverage from correction is expected since it will only correct the longest 40X of data which is being reduced to 26X.

I think what you'd want to do is use the best overlaps for correction and correct as much data as possible. To do that I'd set `corOutCoverage=100 corMaxEvidenceErate=0.15 corMhapSensitivity=normal``
This will keep the best overlaps only to avoid mixing reads as much as possible and use them to correct all the data it can to give more coverage for assembly. Canu already computes all-pairs overlaps so you could re-use your previous run as in issue #197 or you can re-start from scratch.

What do the sizes look like in your unitig outputs (005log and 009log, you can do tail on both files)? Based on that it's possible to adjust the unitigging parameters to merge more aggressively.

@suryasaha
Copy link
Contributor Author

Thanks for the quick reply @skoren !

tail  41cells.acp-corrected-trim-err0.013/unitigging/4-unitigger/acp.err0.013.005.popBubbles.thr000.num000.log
potential bubble 25419 had no valid placements (all were not contained in target tig)
potential bubble 25471 had no valid placements (all were not contained in target tig)
potential bubble 25490 had no valid placements (all were not contained in target tig)
potential bubble 25496 had no valid placements (all were not contained in target tig)
potential bubble 25526 had no valid placements (all were not contained in target tig)
potential bubble 25595 had no valid placements (all were not contained in target tig)
potential bubble 25632 had no valid placements (all were not contained in target tig)
potential bubble 25642 had no valid placements (all were not contained in target tig)
potential bubble 25652 had no valid placements (all were not contained in target tig)

I don't have a 009*log. I used errorRate but I think I should have used utgOvlErrorRate. Rerun?

@brianwalenz
Copy link
Member

If you're more-or-less up to date with github, the sizes are reported in *.sizes files.

@suryasaha
Copy link
Contributor Author

tail 41cells.acp-corrected-trim-err0.013/unitigging/4-unitigger/acp.err0.013.005.popBubbles.sizes 
ng050    316448   lg050      280   sum   225302223  (CONTIGS)
ng060    211773   lg060      455   sum   270039057  (CONTIGS)
ng070    139082   lg070      722   sum   315031409  (CONTIGS)
ng080     95134   lg080     1116   sum   360080370  (CONTIGS)
ng090     69348   lg090     1675   sum   405010046  (CONTIGS)
ng100     51039   lg100     2437   sum   450000458  (CONTIGS)
ng110     37058   lg110     3475   sum   495008941  (CONTIGS)
ng120     26336   lg120     4920   sum   540022412  (CONTIGS)
ng130     18230   lg130     6969   sum   585005345  (CONTIGS)
ng140      8766   lg140    10267   sum   630004066  (CONTIGS)

@skoren
Copy link
Member

skoren commented Jul 19, 2016

Do you have the other sizes file (004 and 008). It would also be interesting to see 001.filterOverlaps.thr000.num000.log

The next question would be do you want to try to separate out the variation or smash it and assemble the consensus of all the heterozygosity.

@suryasaha
Copy link
Contributor Author

I would like to try to wean out the separate individuals in this population but that might not produce a good assembly for any one. But its definitely worth trying. If that does not work, we would just have to take the other route and smash it all into one consensus assembly.

tail acp.err0.013.001.filterOverlaps.thr000.num000.log
FINAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   28964 reads have no best edges (singleton)
   37584 reads have only one best edge (spur) 
             9805 are mutual best
  181111 reads have two best edges 
             9409 have one mutual best edge
           153914 have two mutual best edges
tail acp.err0.013.004.placeContains.sizes
ng050    316448   lg050      280   sum   225302223  (CONTIGS)
ng060    211773   lg060      455   sum   270039057  (CONTIGS)
ng070    139082   lg070      722   sum   315031409  (CONTIGS)
ng080     95134   lg080     1116   sum   360080370  (CONTIGS)
ng090     69348   lg090     1675   sum   405010046  (CONTIGS)
ng100     51048   lg100     2437   sum   450006634  (CONTIGS)
ng110     37085   lg110     3475   sum   495029098  (CONTIGS)
ng120     26344   lg120     4918   sum   540001280  (CONTIGS)
ng130     18324   lg130     6962   sum   585004130  (CONTIGS)
ng140      9271   lg140    10207   sum   630001704  (CONTIGS)
tail acp.err0.013.008.generateOutputs.sizes
ng030    314709   lg030      240   sum   135223559  (CONTIGS)
ng040    226133   lg040      408   sum   180150489  (CONTIGS)
ng050    162254   lg050      646   sum   225050948  (CONTIGS)
ng060    121695   lg060      968   sum   270086466  (CONTIGS)
ng070     90000   lg070     1399   sum   315042039  (CONTIGS)
ng080     69137   lg080     1972   sum   360006092  (CONTIGS)
ng090     53284   lg090     2716   sum   405014332  (CONTIGS)
ng100     41088   lg100     3682   sum   450014099  (CONTIGS)
ng110     31167   lg110     4945   sum   495022065  (CONTIGS)
ng120     22122   lg120     6645   sum   540021100  (CONTIGS)

@skoren
Copy link
Member

skoren commented Jul 19, 2016

First, you can try improving the current unitigging which should be fast. You can copy the 4-unitigger/unitigger.sh which should look like:

#!/bin/sh

if [ -e chm1/unitigging/asm.tigStore/seqDB.v001.tig ] ; then
  exit 0
fi

syst=`uname -s`
arch=`uname -m`
name=`uname -n`

if [ "$arch" = "x86_64" ] ; then
  arch="amd64"
fi
if [ "$arch" = "Power Macintosh" ] ; then
  arch="ppc"
fi

bin="canu-1.3/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="canu-1.3"
fi


$bin/bogart \
 -G chm1/unitigging/asm.gkpStore \
 -O chm1/unitigging/asm.ovlStore \
 -T chm1/unitigging/asm.tigStore.WORKING \
 -o chm1/unitigging/4-unitigger/asm \
 -B 21237 \
 -gs 3100000000 \
 -eg 0.039 \
 -eM 0.039 \
 -el 500 \
 -dg 6 \
 -db 6 \
 -dr 3 \
 -ca 5000 \
 -cp 500 \
 -threads 16 \
 -M 168 \
 -unassembled 2 1000 0.75 0.75 2 \
 > chm1/unitigging/4-unitigger/unitigger.err 2>&1 \
&& \
mv chm1/unitigging/asm.tigStore.WORKING chm1/unitigging/asm.tigStore.FINISHED

exit 0

file to a new folder (say 4-test) and edit the -T and -o to point to 4-test instead of 4-unitigger. Update the standard deviation allowed for assembly to be lower and drop the repeat breaking thresholds so the file looks like:

#!/bin/sh

syst=`uname -s`
arch=`uname -m`
name=`uname -n`

if [ "$arch" = "x86_64" ] ; then
  arch="amd64"
fi
if [ "$arch" = "Power Macintosh" ] ; then
  arch="ppc"
fi

bin="canu-1.3/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="canu-1.3"
fi


$bin/bogart \
 -G chm1/unitigging/asm.gkpStore \
 -O chm1/unitigging/asm.ovlStore \
 -T chm1/unitigging/4-test/asm.tigStore \
 -o chm1/unitigging/4-test/asm \
 -B 21237 \
 -gs 3100000000 \
 -eg 0.039 \
 -eM 0.039 \
 -el 500 \
 -dg 2 \
 -db 2 \
 -dr 1 \
 -ca 500 \
 -cp 50 \
 -threads 16 \
 -M 168 \
 -unassembled 2 1000 0.75 0.75 2 \
 > unitigger.err 2>&1

exit 0

This will do a better job separating the individuals and you'll likely end up with 1.5-2x your expected genome size. If you want to smash instead, increase the dg and db. Trying this on your current assembly is quick and should increase your NG50. However, I think you want to run the correction to get more output coverage so you don't have coverage gaps when assembling as I suggested in an earlier post.

@brianwalenz
Copy link
Member

What's the start of acp.err0.013.001.filterOverlaps.thr000.num000.log have? It's a small file. The median error rate is more or less what overlaps are used. You might get a small gain in separation by dropping -eg and -eM - these will prevent bogart from even seeing the high error overlaps at all.

Comparing the same file with later attempts will give an indication when you go too far - I expect the number of reads with two best edges -- in particular, two mutual best edges, will plummet.

For smashy-smashy, you'd want to increase -eg and -eM, but your overlaps might not be computed that high to begin with. I think they're at 3.9% == 0.039. Check the script in 1-overlapper. In which case, you'll be computing overlaps again.

@suryasaha
Copy link
Contributor Author

suryasaha commented Jul 19, 2016

Hi @brianwalenz Yes, -eg and -eM are at 0.039.

I will try to improve the correction first and then optimize -dg and -db as well as -eg and -eM unless acp.err0.013.001.filterOverlaps.thr000.num000.log suggests otherwise.

cat acp.err0.013.001.filterOverlaps.thr000.num000.log

INITIAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   21389 reads have no best edges (singleton)
   25546 reads have only one best edge (spur) 
            17517 are mutual best
  200724 reads have two best edges 
            19813 have one mutual best edge
           177801 have two mutual best edges


ERROR RATES (2433306 samples)
-----------
mean   0.01146215 stddev 0.00897723 -> 0.06532555 fraction error =   6.532555% error
median 0.00880000 mad    0.00510000 -> 0.05416756 fraction error =   5.416756% error


EDGE FILTERING
-------- ------------------------------------------
       0 reads have a suspicious overlap pattern
       0 reads had edges filtered
                0 had one
                0 had two
   20189 reads have length incompatible edges
            17883 have one
             2306 have two

FINAL EDGES
-------- ----------------------------------------
 1176542 reads are contained
   28964 reads have no best edges (singleton)
   37584 reads have only one best edge (spur) 
             9805 are mutual best
  181111 reads have two best edges 
             9409 have one mutual best edge
           153914 have two mutual best edges
tail -20 1-overlapper/overlap.sh 
  bin="/data/.../canu"
fi


$bin/overlapInCore \
  -t 6 \
  -k 22 \
  -k canu/41cells.acp-corrected-trim-err0.013/unitigging/0-mercounts/acp.err0.013.ms22.frequentMers.fasta \
  --hashbits 23 \
  --hashload 0.75 \
  --maxerate  0.039 \
  --minlength 500 \
  $opt \
  -o canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.WORKING.gz \
  -s canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.stats \
 canu/41cells.acp-corrected-trim-err0.013/unitigging/acp.err0.013.gkpStore \
&& \
mv canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.WORKING.gz canu/41cells.acp-corrected-trim-err0.013/unitigging/1-overlapper/$job.ovb.gz

exit 0

@brianwalenz
Copy link
Member

Moving from active issue to feature request. https://github.com/marbl/canu/projects/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants