There are several (but not many) tools available for metagenomic assembly. The choice of tool often depends on the type of sequencing data (short reads vs. long reads), computational resources, and specific project requirements.
For a benchmark, see Goussarov et al., or Zhang et al..
đź’ˇ TL;DR For Illumina dataset we usually recommend MegaHit for its realiability. An interesting tool, especially for training, is Minia, that focuses on low memory usage.
Below are some commonly used assemblers for metagenomic data:
Most of these tools can be installed via Bioconda. For example, to install MEGAHIT using Conda, you can run:
mamba create -n metadenovo -c conda-forge -c bioconda \
megahit minia metamdbg flye
We can use a subsampled dataset to test Minia.
To make our command a bit more generic, we will write them using Bash variable R1 and R2 to point to the input files.
For example, you can set the variables like this:
# Set the absolute paths to your subsampled FASTQ files
export R1=/path/to/ERR2231569_1_subsample.fastq.gz
export R2=/path/to/ERR2231569_2_subsample.fastq.gz
export R1=
minia \
-in "$R1" \
-out minia-assembly/T16 \
-kmer-size 41 \
-abundance-min 3 \
-max-memory 24000 \
-nb-cores 8
-in: Input file (for paired-end data, provide both files separated by a comma)-out: Output “basename” (directory and prefix for output files)-kmer-size: Size of k-mers to use for assembly (common values are 21, 31, 41)-abundance-min: Minimum k-mer abundance to consider (helps filter out errors)-max-memory: Maximum memory to use (in MB)-nb-cores: Number of CPU cores to useYou will find two fasta files in the output directory: minia-assembly/T16.contigs.fa and minia-assembly/T16.unitigs.fa.
Unitigs are the smallest, unambiguous paths in the de Bruijn graph where the sequence is forced with no branching.
Contigs are longer sequences formed by merging unitigs after cleaning the graph to remove bubbles, tips, and ambiguities.
Fields:
>0, >1, >2 - Contig identifierLN:i:487 - Length in base pairs (integer)KC:i:2566 - K-mer Count - total number of k-mers in this contig (integer)km:f:5.741 - K-mer mean abundance - average coverage/depth (float)L:+:6051:+ - Links to other contigs in the assembly graph, format: “L:orientation_this:target_contig:orientation_target”Megahit syntax is similar, but it supports paired end reads via separate arguments. The output directory will contain all the files produced.
megahit -1 $R1 -2 $R2 -o megahit-assembly/T16/ -t 8
megahit-assembly/T16/final.contigs.fa
multi=.... and the connectivity of a contig in the assembly graph. flag=1 means the contig is standalone, flag=2 a looped path and flag=0 for other contigs.We can use QUAST to compare the assemblies.
# Generic syntax
quast -o $OUTDIR FASTA_1 FASTA_2 ...
The bare metrics can be obtained with SeqFu:
seqfu stats -ntb minia-dir/*.fa
Each assembler use a different naming scheme for the contigs they produce. Some downstream tools (especially Anvi’o), might be picky on which characters are allowed.
Check the output file to see what information the assembler encoded in the contig names.
It’s safe to rename them, and using SeqFu we can also keep a conversion table:
# This will produce `rename-report.txt`: a two columns table
seqfu cat --anvio $ORIGINAL_CONTIGS > renamed_contigs.fa
The --anvio is a shortcut for
seqfu cat -p c_ -s -z --zero-pad 12 --report rename_report.txt $ORIGINAL_CONTIGS > renamed_contigs.fa
Where:
-p "c_" means use the string c_ as the prefix of all names-s (or --strip-comments) means remove any comment-z (or --strip-name) means remove the original name--zero-pad 12 means use leading zeroes against the progressive number (12 digits)