Skip to content

qichao1984/SCycDB

Repository files navigation

SCycDB

SCycDB: A Curated Functional Gene Database for Metagenomic Profiling of Sulfur Cycling Pathways

Alternative links containing unzipped database files: https://www.alipan.com/s/Gs62Gxax7ii

Yu, X., Zhou, J., Song, W., Xu, M., He, Q., Peng, Y., Tian, Y., Wang, C., Shu, L., Wang, S., Yan, Q., Liu, J., Tu, Q. and He, Z. (2020), SCycDB: A Curated Functional Gene Database for Metagenomic Profiling of Sulfur Cycling Pathways. Mol Ecol Resour. https://doi.org/10.1111/1755-0998.13306

The sulfur (S) cycle driven by microorganisms is an important biogeochemical cycling process of the Earth's biosphere, and it is usually coupled with carbon, nitrogen and metal cycling in natural ecosystems. Shotgun metagenome sequencing has opened a new avenue to advance our understanding of S cycling microbial communities. However, accurate metagenomic profiling of S cycling microbial communities remains technically challenging, mainly due to low coverage of S cycling genes/pathways, difficulties in distinguishing homologous genes and a long research time on publicly available orthology databases. It is essential to develop a comprehensive and accurate database for characterizing S cycling microbial communities in metagenomic studies. To solve those problems, we constructed a manually curated sulfur cycling database (SCycDB) for metagenome sequencing data analysis of S cycling microbial communities in the environment.

The developed SCycDB contains 207 gene families and 585,055 representative sequences affiliated with 52 phyla and 2684 genera of bacteria/archaea, and 20,761 homologous orthology groups were also included to reduce false positive sequence assignments.

Four files are included in SCycDB:

1. SCycDB_2020Mar.zip: fasta format representative sequences obtained by clustering curated sequences at 100% sequence identity. This file can be used for "BLAST" searching SCycDB genes in shotgun metagenomes.

2. id2gene.2020Mar.map: a mapping file that maps sequence IDs to gene names, only sequences belonging to SCycDB gene families are included. Sequences for SCycDB homologs are not included. This file is used to generate SCycDB profiles from BLAST-like results against the SCycDB database.

3. SCycDB_FunctionProfiler.PL: a perl script for functional profiling of S cycling genes.

4. SCycDB_TaxonomyProfiler.PL: a perl script for taxonomical profiling of S cycling microbial communities.

DOWNLOAD/INSTALLATION

git clone https://github.com/qichao1984/SCycDB.git

Dependencies and Tools

Perl modules that can be easily installed via cpan:

List::Util

Getopt::Long

Dependencies for SCycDB_FunctionProfiler.PL, currently supported database searching tools are:

usearch: https://www.drive5.com/usearch/download.html

diamond: https://github.com/bbuchfink/diamond/releases

blast: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz

Dependencies for SCycDB_TaxonomyProfiler.PL:

seqtk: https://github.com/lh3/seqtk.git

kraken2: https://github.com/DerrickWood/kraken2.git

USAGE

Before getting started, please modify both scripts (SCycDB_FunctionProfiler.PL, SCycDB_TaxonomyProfiler.PL) at lines 6-18 to specify the locations of third party tools and their parameters. If the tools are already in the system path, no revision is needed. By default, basic parameters are used for these tools. Users are encouraged to make revisions in cases of short reads and/or expecting more strict/relaxed results. We also encourage users to develop useful implementations based on SCycDB.

Note: Kraken2 database could be downloaded from https://ccb.jhu.edu/software/kraken2/index.shtml?t=downloads, or built locally.

Example for using SCycDB_FunctionProfiler.PL:

perl SCycDB_FunctionProfiler.PL -d <workdir> -m <diamond|usearch|blast> -f <filetype> -s <seqtype> -si <sample size info file> -rs <random sampling size> -o <outfile>

Detailed explanations:

-d : specify the directory where your fasta/fastq (or gzipped) files are located.

-m : specify the database searching program you plan to use, currently diamond, usearch and blast are supported.

-f : specify the extensions of your sequence files, e.g. fastq, fastq.gz, fasta,fasta.gz, fq, fq.gz, fa, fa.gz

-s : sequence type, nucl or prot

-si: a tab delimited file containing the sample/file name and the number of sequences they have, note that no file extensions should be included here.

-rs: specify the number of sequences for random subsampling, if not specified, the lowest number in -si will be used.

-o : the output file for N cycle gene profiles.

Example for using SCycDB_TaxonomyProfiler.PL:

perl SCycDB_TaxonomyProfiler.PL -d <workdir> -m <diamond|usearch|blast> -f <filetype> -s <seqtype> -si <sample size info file> -rs <random sampling size>

Detailed explanations:

-d : specify the directory where your fasta/fastq (or gzipped) files are located.

-m : specify the database searching program you plan to use, currently diamond, usearch and blast are supported.

-f : specify the extensions of your sequence files, e.g. fastq, fastq.gz, fasta,fasta.gz, fq, fq.gz, fa, fa.gz

-s : sequence type, nucl or prot

-si: a tab delimited file containing the sample/file name and the number of sequences they have, note that no file extensions should be included here.

-rs: specify the number of sequences for random subsampling, if not specified, the lowest number in -si will be used.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages