Anvio in the QIB HPC

Anvi’o workflow

Anvi’o is a collection of tools to analyse, integrate and visualise microbial genomics datasets.

Interactive visualisations are better done locally in your computer: you will not eexperience issues related to the connection to the machine hosting Anvi’o.

The non-interactive steps can require some computational power that is commonly available in High Perfomance Computers (HPCs) or dedicated servers (VMs, workstations etc.), especially when they need to run for a (relatively) long time.

Here we describe how to run non-interactive steps on the HPC available to the researchers at the Quadram Institute Bioscience.

Anvi’o on the HPC

You will need appropriate training from Research Computing to use the NBI HPC

Core Bioinformatics developed a utility tool called NBI Slurm that simplifies some steps (including making Anvio package!).

  • Activate nbi-slurm
source package nbi-slurm
  • Check for Anvio packages

One of the tools available is shelf, that allows for a quick search of packages

shelf anvio

You will see that - thanks to this Workshop - Core Bioinformatics packaged anvio 8:

# Recommended
anvio-8     source package /nbi/software/testing/bin/anvio-8
# Anvio and its databases
anvio-8-db  source package /nbi/software/testing/bin/anvio-8-db

Both packages are experimental, one carries dbs the other will allow you to use the manually downloaded databases.

Example

Following the Vibrioi jascida pangenome, we downloaded the FASTA files and we want to filter them by length.

We can loop and use runjob to submit the job to the cluster.

Basically you can run any command specifying the required resource like:

runjob -c 8 -m 12G "spades.py -s reads.fq..."
# You should be in an interactive session!
source package nbi-slurm
source package anvio-8

# Generate a list of genome IDs (cutting on the "_")
ls *fasta | awk 'BEGIN{FS="_"}{print $1}' > genomes.txt

# Loop through the genomes
for g in $(cat genomes.txt)
do
    runjob --run -w logs -c 4 -m 16G -t 2h  -n anvi-format-$g \
      "anvi-script-reformat-fasta ${g}_scaffolds.fasta --min-len 2500 \
      --simplify-names -o ${g}_scaffolds_2.5K.fasta"
done

If one job depends on a previous one we can chain them like this:

for g in `cat genomes.txt`
do
    JOB_ID=$(runjob -r -c 4 -m 16G --run -w logs -n db-$g "anvi-gen-contigs-database -f ${g}_scaffolds_2.5K.fasta -o V_jascida_${g}.db --num-threads 4 -n Vj_${g}")
    runjob -c 8 -m 16G --run -w logs "anvi-run-hmms -c V_jascida_${g}.db --num-threads 4" -n hmms-8 --after $JOB_ID
done

Databases

We are downloading the databases in our staging partition:

/qib/platforms/Informatics/transfer/outgoing/databases/anvio-8/