It is commonplace in virome analyses to cluster similar sequences together at the approximate level of species. These clustered sequences are referred to as viral operational taxonomic units (vOTUs). To do this, we’re going to use a combination of BLAST and python scripts that are available alongside CheckV.
To download these scripts you can run the following:
wget -O ~/bin/aniclust.py \
https://bitbucket.org/berkeleylab/checkv/raw/51a5293f75da04c5d9a938c9af9e2b879fa47bd8/scripts/aniclust.py
wget -O ~/bin/anicalc.py \
https://bitbucket.org/berkeleylab/checkv/raw/51a5293f75da04c5d9a938c9af9e2b879fa47bd8/scripts/anicalc.py
# Set them executable with "chmod +x aniclust.py", "chmod +x anicalc.py". And put the scripts in your path.
chmod +x ~/bin/*.py
This might look daunting, but let’s work through it one command at a time:
First, create a blast+ database to be able to align sequences against it:
# DO NOT COPY THIS COMMAND
makeblastdb -in ${my_seqs_fna} -dbtype nucl -out ${output}
# example : makeblastdb -in genomad-out/genomad_votus.fna -dbtype nucl -out genomad_votus.db
Next, use megablast from blast+ package to perform all-vs-all blastn of sequences (as the query and the target are both your vOTUs):
# DO NOT COPY THIS COMMAND
blastn -query ${my_seqs_fna} -db ${output} \
-outfmt '6 std qlen slen' -max_target_seqs 10000 \
-out ${out_tsv_file} -num_threads 8
# example : blastn -query genomad-out/genomad_votus.fna -db genomad_votus.db -outfmt '6 std qlen slen' -max_target_seqs 10000 -out genomad_blast.tsv -num_threads 8
Next, calculate pairwise ANI by combining local alignments between sequence pairs:
# DO NOT COPY THIS COMMAND
anicalc.py -i ${out_tsv_file} -o ${ani_tsv}
# example python scripts/anicalc.py -i genomad_blast.tsv -o genomad_ani.tsv
Finally, perform UCLUST-like clustering using the MIUVIG recommended-parameters (95% ANI + 85% AF):
# DO NOT COPY THIS COMMAND
aniclust.py --fna ${my_seqs_fna} --ani ${ani_tsv} \
--out ${clusters_tsv} \
--min_ani 95 --min_tcov 85 --min_qcov 0
# example : python scripts/aniclust.py --fna genomad-out/genomad_votus.fna --ani genomad_ani.tsv --out vOTU_genomad_c95.tsv --min_ani 95 --min_tcov 85 --min_qcov 0
Take a look at the my_clusters.tsv
file, this contains two tab-delimitied columns (i.e. separated by tabs). The first is a representative sequence for each cluster, this is what we’re interested in. To get them, we’ll use an awk command to get the first column and write it to a new file called my_vOTUs.txt
:
# DO NOT COPY THIS COMMAND
awk '{print $1}' ${clusters_tsv} > VOTUS_FINAL.tsv
example : awk '{print $1}' vOTU_genomad_c95.tsv > vOTU_genomad_representatives.tsv
We can then use a handy utility in seqfu
called seqfu list
, which is already installed! This will take a list of sequence IDs and extract the matching seqs from a fasta file:
# DO NOT COPY THIS COMMAND
seqfu list my_vOTUs.txt <genomad_viruses.fna> > <my_vOTUs.fa>
# seqfu list vOTU_genomad_representatives.tsv genomad-out/genomad_viruses.fna > vOTUs_representatives.fa
These steps can be naturally automated, for example with a Python script.
You would need to download it in the same directory where aniclust.py
and anicalc.py
are, and run from within the
checkV environment, as it will use it’s dependencies (and SeqFu in addition).