Every miner use a mix of signals to try to identify a contig as viral. The enormous diversity, the fact that there are very short phage genomes, and lack of universal markers, makes this process very challenging.
Now that we have some predicted viruses, let’s look at their completeness! CheckV is a tool developed to assess the quality and completeness of viral genomes assembled from viromes/metagenomes.
CheckV, similarily to geNomad, is also a pipeline we will run end-to-end to:
We can once again install using mamba/conda as below:
mamba create -n checkv -c conda-forge -c bioconda checkv
conda activate checkv
You will find the pre-downloaded database in $DB/checkv-db-v1.5/
How to download the checkV database at home?
If you are running this tutorial outside of EBAME VM, you will need to download the checkV database
checkv download_database ./
Don’t forget to update your checkV script to point to your database localisation!
Now remember to activate the environment and invoke the help message (similar to geNomad above…).
Try to create to run checkV on the predicted viruses from geNomad.
checkv end_to_end ~/genomad-out/genomad_votus.fna ~/checkv-out -d $DB/checkv-db-v1.5/ -t 8
The parameters are <ASSEMBLY>, <OUTPUT_DIR>, -d <DB_DIR>, -t for the number of threads.
Now you can check the output files.
With ls and find you can list them, and if you find a table you can inspect it interactively with vd (visidata, press q to quit).
In the summary columns, AAI stands for “average amino acid identity”,
To list the files
ls -lh ~/checkv-out/
To inspect the summary table:
vd checkv-out/quality_summary.tsv
Similarly, you can check the other TSV files.
| Column | Explanation |
|---|---|
| contig_id | Unique identifier for your viral sequence |
| contig_length | Total length of the sequence in base pairs |
| provirus | Whether the sequence is an integrated provirus (Yes/No) |
| proviral_length | If provirus=Yes, length of just the viral region (host contamination removed) |
| gene_count | Total number of genes detected on the contig |
| viral_genes | Number of genes identified as viral |
| host_genes | Number of genes identified as host/bacterial |
| checkv_quality | Quality tier: Complete, High-quality (>90%), Medium-quality (50-90%), Low-quality (<50%), or Not-determined |
| miuvig_quality | MIUViG standard classification (Complete, High-quality, Medium-quality, Low-quality, or Genome-fragment) |
| completeness | Estimated % completeness of the viral genome (0-100%) |
| completeness_method | How completeness was calculated: AAI-based (amino acid identity to reference genomes) or HMM-based (viral protein families) |
| contamination | % of the contig that’s non-viral (host contamination) |
| kmer_freq | Average k-mer frequency - values >1 suggest the genome appears multiple times (concatemers) |
| warnings | Any quality issues detected (e.g., “no viral genes detected”) |