Every miner use a mix of signals to try to identify a contig as viral. The enormous diversity, the fact that there are very short phage genomes, and lack of universal markers, makes this process very challenging.

Now that we have some predicted viruses, let’s look at their completeness! CheckV is a tool developed to assess the quality and completeness of viral genomes assembled from viromes/metagenomes.

CheckV, similarily to geNomad, is also a pipeline we will run end-to-end to:

  • Remove host contamination on proviruses (uses HMMs),
  • Estimate genome completeness
  • Predict full genomes (Direct terminal repeats, Proviruses, Inverted terminal repeats)

Installation

We can once again install using mamba/conda as below:

mamba create -n checkv -c conda-forge -c bioconda checkv
conda activate checkv

You will find the pre-downloaded database in $DB/checkv-db-v1.5/

:information_source: How to download the checkV database at home?

If you are running this tutorial outside of EBAME VM, you will need to download the checkV database

  checkv download_database ./

Don’t forget to update your checkV script to point to your database localisation!

Running CheckV

Now remember to activate the environment and invoke the help message (similar to geNomad above…).

:pencil2: Try to create to run checkV on the predicted viruses from geNomad.

checkv end_to_end ~/genomad-out/genomad_votus.fna ~/checkv-out -d $DB/checkv-db-v1.5/ -t 8

The parameters are <ASSEMBLY>, <OUTPUT_DIR>, -d <DB_DIR>, -t for the number of threads.

Now you can check the output files.

With ls and find you can list them, and if you find a table you can inspect it interactively with vd (visidata, press q to quit).

In the summary columns, AAI stands for “average amino acid identity”,

To list the files

ls -lh ~/checkv-out/

To inspect the summary table:

vd checkv-out/quality_summary.tsv 

Similarly, you can check the other TSV files.

Column Descriptions

Column Explanation
contig_id Unique identifier for your viral sequence
contig_length Total length of the sequence in base pairs
provirus Whether the sequence is an integrated provirus (Yes/No)
proviral_length If provirus=Yes, length of just the viral region (host contamination removed)
gene_count Total number of genes detected on the contig
viral_genes Number of genes identified as viral
host_genes Number of genes identified as host/bacterial
checkv_quality Quality tier: Complete, High-quality (>90%), Medium-quality (50-90%), Low-quality (<50%), or Not-determined
miuvig_quality MIUViG standard classification (Complete, High-quality, Medium-quality, Low-quality, or Genome-fragment)
completeness Estimated % completeness of the viral genome (0-100%)
completeness_method How completeness was calculated: AAI-based (amino acid identity to reference genomes) or HMM-based (viral protein families)
contamination % of the contig that’s non-viral (host contamination)
kmer_freq Average k-mer frequency - values >1 suggest the genome appears multiple times (concatemers)
warnings Any quality issues detected (e.g., “no viral genes detected”)
  • AAI-based (high-confidence): Most accurate method, compares your sequence to complete viral genomes in the database
  • HMM-based: Used when no close reference exists, less precise but more sensitive
  • kmer_freq > 1: Indicates potential assembly artifacts (genome repeated in contig)

Previous submodule:
Next submodule: