Anvio in the QIB HPC
Written on June 3rd , 2022 by Andrea TelatinAnvi’o workflow
Anvi’o is a collection of tools to analyse, integrate and visualise microbial genomics datasets.
Interactive visualisations are better done locally in your computer: you will not eexperience issues related to the connection to the machine hosting Anvi’o.
The non-interactive steps can require some computational power that is commonly available in High Perfomance Computers (HPCs) or dedicated servers (VMs, workstations etc.), especially when they need to run for a (relatively) long time.
Here we describe how to run non-interactive steps on the HPC available to the researchers at the Quadram Institute Bioscience.
Anvi’o on the HPC
You will need appropriate training from Research Computing to use the NBI HPC
Core Bioinformatics developed a utility tool called NBI Slurm that simplifies some steps (including making Anvio package!).
- Activate nbi-slurm
source package nbi-slurm
- Check for Anvio packages
One of the tools available is shelf
, that allows for a quick search of packages
shelf anvio
You will see that - thanks to this Workshop - Core Bioinformatics packaged anvio 8:
# Recommended
anvio-8 source package /nbi/software/testing/bin/anvio-8
# Anvio and its databases
anvio-8-db source package /nbi/software/testing/bin/anvio-8-db
Both packages are experimental, one carries dbs the other will allow you to use the manually downloaded databases.
Example
Following the Vibrioi jascida pangenome, we downloaded the FASTA files and we want to filter them by length.
We can loop and use runjob
to submit the job to the cluster.
Basically you can run any command specifying the required resource like:
runjob -c 8 -m 12G "spades.py -s reads.fq..."
# You should be in an interactive session!
source package nbi-slurm
source package anvio-8
# Generate a list of genome IDs (cutting on the "_")
ls *fasta | awk 'BEGIN{FS="_"}{print $1}' > genomes.txt
# Loop through the genomes
for g in $(cat genomes.txt)
do
runjob --run -w logs -c 4 -m 16G -t 2h -n anvi-format-$g \
"anvi-script-reformat-fasta ${g}_scaffolds.fasta --min-len 2500 \
--simplify-names -o ${g}_scaffolds_2.5K.fasta"
done
If one job depends on a previous one we can chain them like this:
for g in `cat genomes.txt`
do
JOB_ID=$(runjob -r -c 4 -m 16G --run -w logs -n db-$g "anvi-gen-contigs-database -f ${g}_scaffolds_2.5K.fasta -o V_jascida_${g}.db --num-threads 4 -n Vj_${g}")
runjob -c 8 -m 16G --run -w logs "anvi-run-hmms -c V_jascida_${g}.db --num-threads 4" -n hmms-8 --after $JOB_ID
done
Databases
We are downloading the databases in our staging partition:
/qib/platforms/Informatics/transfer/outgoing/databases/anvio-8/