Skip to content

mocat2/mocat2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Notice

MOCAT is no longer being actively developed and supported
Please consider switching to newer tools such as NGLess and NGLess meta profilers in conjunction with the mOTUs profiler, the GMGC gene catalog and the EggNOG database and mapper.

MOCAT 2

MOCAT2 (metagenomic analysis toolkit) is a package for analyzing metagenomics datasets. Currently MOCAT2 supports Illumina single- and paired-end reads in raw FastQ format. Using MOCAT2 you can generate taxonomic and functional profiles, as well as assemble reads and predict genes in assembled sequences. The official MOCAT2 page is http://mocat.embl.de/

Unfortunately, development of MOCAT2 has stopped. Here is the MOCAT2 Bioinformatics paper (http://bit.ly/1VbJnzi). This Github repo contains the latest MOCAT2 version.

HOW TO INSTALL

  1. Clone the repository.
  2. Use the files in stable/2.1.3. Run the setup.MOCAT2.pl script.
  3. To use the functional annotation step, you need to download and extract the required data file into the MOCAT folder: http://vm-lux.embl.de/~kultima/share/MOCAT/v2.0/MOCAT2_data_files.zip (4GB)

OUTPUT FILES

Functional profiles These are gene coverages summarized at a higher level, such as a KEGG KO, module or pathway level or eggNOG OGs. Genes are summarized at these categories based on a mapping file in the MOCAT/data folder (.functional.map). This means, that even though named functional profiles, these can be summarized at other user-defined levels, such as species, genera or phyla or even function and taxonomic representation such as KO.species or KO.genus.

Taxonomic profiles Taxonomic profiles come in two flavors: mOTU and NCBI. Each of these requires a specific set of mapping files and specific requirements for the database structure. The current version of MOCAT2 ships with a database for each of the two flavors: mOTU.v1 and RefMG.v1, respectively.

mOTU profiles These are generated by first mapping and filtering reads to the mOTU.v1 database and then in the profiling option selecting -mode mOTU. The abundances of 10 marker genes are summarized into (annotated) mOTU linkage groups (mOTU-LGs).

NCBI profiles By mapping and filtering reads against the RefMG.v1 database, the profiling step with option -mode NCBI will summarize the gene abundances into NCBI taxa level coverages: phylum to species, including specI (Mende et al., 2013) coverages.

Different output formats Both insert and base coverages are calculated in MOCAT. An insert is defined as either a single read or a matching read-pair. Furthermore, each of these two coverage types are calculated as raw counts, gene length normalized coverages (norm), and scaled gene length normalized coverages (scaled). Scaled files are gene normalized abundances multiplied by a scaling factor. These files should be utilized when the -1 fraction (i.e. inserts or bases that do not map to the database) is important. All values have been re-scaled (with their respective fraction of the total constant) so that the value of the -1 fraction is the same as in the raw files. This enables the possibility to have gene length normalized counts at the same time as utilizing the -1 fraction. Finally, as a third layer, bases and inserts from reads mapping to more than one gene (i.e. multiple mappers) with the same alignment score are distributed evenly or according to the abundance of bases and insert mapping uniquely (mm.dist.among.unique). MOCAT2 also saves the abundances of genes based on reads mapping to only one unique location (only.unique). The permutations of these options results in the following files as listed below. =item Which of these files should you use? Below we have listed some recommended uses of the different files, but in general we recommend using the mm.dist.among.unique files.

In the sclaed files, gene length normalized insert/base abundances are multiplied by the abundance-weighted average gene length. This enable sthe use of the -1 fraction (as it is constant). If the -1 fraction is irgnored, and relative abundances are used, the results will be the same for the norm and scaled files.

Multiple mappers distributed evenly <base|insert>.raw: inserts used as input to e.g. DESeq2 <base|insert>.norm: base.norm would represent the most commonly used gene length normalized base counts. This was used in (Zeller, et al., 2014) <base|insert>.scaled: If profile abundances are used (total row sum is 1), using these files would yield same results as using the .norm files.

Multiple mappers distributed according to unique <base|insert>.mm.dist.among.unique.raw: this could also be used in DEseq2 <base|insert>.mm.dist.among.unique.norm: For normal use, we would recommend using these files <base|insert>.mm.dist.among.unique.scaled: these are the values which taxonomic profiles are calculated upon, and should you require using the -1 fraction these files should be used

Coverages based only on uniquely mapping reads These files would be used in instances where its important to discard reads mapping to multiple genes. <base|insert>.only.unique.raw <base|insert>.only.unique.norm <base|insert>.only.unique.scaled

Retaining functional abundance tables To calculate abundances, in the case of functional profiles, we recommend dividing each feature value with the total number of mapped bases/inserts. Note that this may not necessarily sum up to 1, as each gene can be annotated to multiple functional features.