CheckV offers two scripts to perform the dereplication.
For EBAME 10 we packaged the two scripts (and some more) into a Python package called votuderep, that works with subcommands.

mamba install -c conda-forge -c bioconda votuderep
The tool contains two modules to download:
votuderep trainingdata -o ~/virome/
votuderep getdbs -o ~/db/
votuderep derep --help
Usage: votuderep derep [OPTIONS]
Dereplicate vOTUs using BLAST and ANI clustering.
This command: 1. Creates a BLAST database from input sequences 2.
Performs all-vs-all BLAST comparison 3. Calculates ANI and coverage for
sequence pairs 4. Clusters sequences by ANI using greedy algorithm 5.
Outputs cluster representatives (longest sequences)
The algorithm selects the longest sequence from each cluster as the
representative, effectively removing shorter redundant sequences.
╭─ Options ──────────────────────────────────────────────────────────────╮
│ * --input -i FILE Input FASTA file containing vOTUs │
│ [required] │
│ --output -o FILE Output FASTA file with dereplicated vOTUs │
│ --threads -t INTEGER Number of threads for BLAST │
│ --tmp TEXT Directory for temporary files (default: │
│ $TEMP or /tmp or ./) │
│ --min-ani FLOAT Minimum ANI to consider two vOTUs as the │
│ same │
│ --min-tcov FLOAT Minimum target coverage to consider two │
│ vOTUs as the same │
│ --keep Keep the temporary directory after │
│ completion │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────╯
The input is a FASTA file with the predicted vOTU (-i FASTA)
and with -o dereplicated.fasta.
As you can see you can tweak the BLAST parameters with --min-ani and --min-tcov,
and you can specify the number of threads with -t.
CheckV generates a table with a summary of its predictions.
With filter you can extract from the
corresponding FASTA file only the sequences based on CheckV quality metrics including length, completeness, contamination, and quality classification.
Inputs
From highest to lowest confidence:
Additionally:
-o, --output - Output FASTA file (default: STDOUT)-m, --min-len - Minimum contig length--max-len - Maximum contig length (0 = unlimited)--min-quality [low|medium|high] - Minimum quality threshold (default: low)
low: Keeps Low-quality and abovemedium: Keeps Medium-quality and abovehigh: Keeps High-quality and Complete only--complete - Keep only Complete genomes--exclude-undetermined - Exclude Not-determined sequences-c, --min-completeness - Minimum completeness percentage--max-contam - Maximum contamination percentage--provirus - Keep only proviruses--no-warnings - Exclude contigs with warnings