Skip to content
/ SVDSS Public

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)

License

Notifications You must be signed in to change notification settings

Parsoa/SVDSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2335846 · Apr 28, 2025
Oct 28, 2022
Feb 23, 2022
Jul 22, 2024
Aug 10, 2023
Aug 10, 2023
May 17, 2022
May 9, 2022
Apr 11, 2025
Mar 9, 2022
Apr 28, 2025
Aug 10, 2023
Jul 26, 2023
Oct 20, 2022
Jul 20, 2023
Apr 8, 2025
Apr 8, 2025
Apr 8, 2025
Apr 8, 2025
Apr 8, 2025
Apr 8, 2025
Aug 10, 2023
Apr 8, 2025
Aug 10, 2023
Apr 11, 2025
Apr 11, 2025
May 22, 2022
Nov 3, 2020
Jul 25, 2023
Aug 10, 2023
Aug 10, 2023
May 28, 2020
Apr 26, 2025
Apr 8, 2025
Apr 8, 2025
Apr 28, 2025
Jul 26, 2023
Jul 31, 2023
Apr 11, 2025
Apr 8, 2025
Aug 10, 2023
Aug 10, 2023

Repository files navigation

Anaconda-Server Badge

SVDSS: Structural Variant Discovery from Sample-specific Strings

Note: SVDSS is designed to work with accurate long reads (e.g., PacBio HiFi). It can theoretically work with other technologies (e.g., ONT) but results may be inaccurate.


SVDSS is a method for structural variations discovery from accurate long reads (e.g PacBio HiFi), based on the notion of sample-specific strings (SFS, or simply specific strings).

SFS are the shortest substrings that are unique to one sample, called target, w.r.t a genome reference. Here our method utilizes SFS for coarse-grained identification (anchoring) of potential SV sites and performs local partial-order-assembly (POA) of clusters of SFS from such sites to produce accurate SV predictions. We refer to our manuscript on SFS for more details regarding the concept of SFS.

Download and Installation

You can get SVDSS in three different ways:

Compilation from Source

To compile and use SVDSS, you need:

  • a C++14-compliant compiler (GCC 8.2 or newer)
  • make, automake, autoconf
  • cmake (>=3.14)
  • git
  • some other development libraries: zlib, bz2, lzma
  • samtools and bcftools (>=1.9)
  • kanpig (optional, just for genotyping)

To install these dependencies:

# On a deb-based system (tested on ubuntu 20.04 and debian 11):
sudo apt install build-essential autoconf cmake git zlib1g-dev libbz2-dev liblzma-dev samtools bcftools
# On a rpm-based system (tested on fedora 35):
sudo dnf install gcc gcc-c++ make automake autoconf cmake git libstdc++-static zlib-devel bzip2-devel xz-devel samtools bcftools

The following libraries are needed to build and run SVDSS but they are automatically downloaded and compiled while compiling SVDSS:

To download and install SVDSS (should take ~10 minutes):

git clone https://github.com/Parsoa/SVDSS.git
cd SVDSS 
mkdir build ; cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

This will create the SVDSS binary in the root of the repo.

Static Binary

For user convenience, we also provide a static binary for x86_64 linux systems (see Releases) - use at your own risk. If it does not work, please let us know.

Install from Conda

SVDSS is available on bioconda:

conda create -n svdss -c conda-forge -c bioconda svdss

This will create the environment svdss that includes SVDSS and its runtime dependencies (i.e., samtools and bcftools).

Usage Guide

Please refer to or use run_svdss.

Usage: run_svdss <reference.fa> <alignments.bam>

Arguments:
     -w                 output directory (default: .)
     -i                 use this FMD index/store it here (default: build FMD index and store to <reference.fa.fmd>)
     -q                 mapping quality (default: 20)
     -p                 accuracy percentile (default: 0.98)
     -s                 minimum support for calling (default: 2)
     -l                 minimum length for SV (default: 50)
     -t                 do not consider haplotagging information (default: consider it)
     -@                 number of threads (default: 4)
     -x                 path to SVDSS binary (default: SVDSS)
     -r                 path to robebwt3 binary (default: ropebwt3)
     -k                 path to kanpig binary (default: kanpig)
     -v                 print version
     -h                 print this help and exit

Positional arguments:
     <reference.fa>     reference file in FASTA format
     <alignments.bam>   alignments in BAM format

Detailed Usage Guide

SVDSS requires as input the BAM file of the sample to be genotyped and a reference genome in FASTA format (please use an appropriate reference genome, i.e., if you are not interested in ALT contigs, filter them out or use a reference genome that does not include them). To genotype a sample we need to perform the following steps:

  1. Build FMD index of reference genome (SVDSS index)
  2. Smooth the input BAM file (SVDSS smooth)
  3. Extract SFS from smoothed BAM file (SVDSS search)
  4. Assemble SFS into superstrings (SVDSS assemble)
  5. Call SVs from the assembled superstrings (SVDSS call)
  6. Genotype SVs using kanpig

In the guide below we assume we are using the reference genome file GRCh38.fa and the input BAM file sample.bam.

Note that you can reuse the index from step 1 for any number of samples genotyped against the same reference genome.

We will now explain each step in more detail:

Index reference genome

Build the FMD index of the reference genome:

SVDSS index --reference GRCh38.fa --index GRCh38.fa.fmd

The --index option specifies the output file name.

Smoothing the target sample

Smoothing removes nearly all SNPs, small indels and sequencing errors from reads. This results in smaller number of SFS being extracted and increases the relevance of extracted SFS to SV discovery significantly. To smooth the sample run:

SVDSS smooth --reference GRCh38.fa --bam sample.bam --threads 16 > smoothed.bam
samtools index smoothed.bam

This writes to stdout the smoothed bam. This file is sorted in the same order as the input file, however it needs to be indexed again with samtools index.

Extract SFS from target sample

To extract SFS run:

SVDSS search --index GRCh38.fa.fmd --bam smoothed.bam > specifics.txt

This writes to stdout the list of specific strings. The output includes the coordinates of SFS relative to the reads they were extracted from.

Call SVs

We are now ready to call SVs. Run (note that the input .bam must be the same used in the search step and must be indexed using samtools):

SVDSS call --reference GRCh38.fasta --bam smoothed.bam --sfs specifics.txt --threads 16 > calls.vcf

You can filter the reported SVs by passing the --min-sv-length and --min-cluster-weight options. These options control the minimum length and minimum number of supporting superstrings for the reported SVs. Higher values for --min-cluster-weight will increase precision at the cost of reducing recall. For a diploid 30x coverage sample, --min-cluster-weight 2 produced the best results in our experiments. For a 30x sample, instead, you can try to increase this to 3 or 4.

This commands output the calls to stdout. Additionally, you can output the alignments of POA contigs against the reference genome (these POA consensus are used to call SVs) using the --poa option.

Example

Note: to run this example, samtools and bcftools must be in your path. Running SVDSS on the example data, once downloaded, should take less than 5 minutes.

cd [svdss-local-repo]

# Download example data from zenodo
wget https://zenodo.org/record/6563662/files/svdss-data.tar.gz
mkdir -p input
tar xvfz svdss-data.tar.gz -C input
# Download SVDSS binary
wget https://github.com/Parsoa/SVDSS/releases/download/v2.1.0/SVDSS_linux_x86-64
chmod +x SVDSS_linux_x86-64

# Run the full pipeline (assuming kanpig is in your path, otherwise SVs won't be genotyped)
./run_svdss -x SVDSS_linux_x86-64 -r ./build/ropebwt-prefix/src/ropebwt/ropebwt3 -w svdss2-output input/22.fa input/22.bam

Authors

SVDSS is developed by Luca Denti, Parsoa Khorsand, and Thomas Krannich.

For inquiries on this software please open an issue.

Citation

SVDSS is published in Nature Methods.

Experiments

Instructions on how to reproduce the experiments described in the manuscript can be found here (also provided as submodule of this repository).