Removing custom hosts from metagenomics

Hostile is a fantastic tool for removing host reads from whole metagenome shotgun datasets. It’s very user-friendly and has pre-built databases for human and murine reads decontaminating.
In this post we focus on using a custom reference database (in this case, Canis lupus familiaris).
A key step is masking the reference, that is removing from the reference regions similar to bacterial (viral, fungal…) regions, as we do not want to risk discarding prokaryotic reads.
Getting bacterial reads
Hostile shares a table of bacterial reference genomes and we will use it to download the genomes to treat as false positive matches. Of course, each one can add more sequences to mask.
From that page we can download a table like:
Assembly | BioSample | Strain | Taxonomy |
---|---|---|---|
GCA_013267415.1 | SAMN11056500 | FDAARGOS_785 | Abiotrophia defectiva |
GCA_016127315.1 | SAMN16357219 | FDAARGOS_1050 | Achromobacter deleyi |
GCA_013267395.1 | SAMN11056501 | FDAARGOS_786 | Achromobacter denitrificans |
This table contains the accession number as the first field. We can download all the genomes using a dedicated script.
When we have a set of separate FASTA file, we can combine them, for example with SeqFu (especially if we think multiple files can have sequences with the same name: we can preprend a prefix to avoid collisions):
1
seqfu cat --basename *.fasta > all_genomes.fasta
Masking the reference
We need the host genome (e. g. the dog genome) and our non-host genomes (i. e. the “all_genomes.fasta” from the previous step). Using hostile:
1
2
3
4
5
# Let's use some variables
HOST_GENOME=/path/to/dog.fasta
BACTERIAL_GENOMES=./genomes/all_genomes.fasta
hostile mask --threads 8 --out-dir masked_host_genome/ "$HOST_GENOME" "$BACTERIAL_GENOMES"
List the files in the output directory to see what files were generated
Run hostile with the masked reference
We need to specify the right index name depending on the mapping method.
When using bowtie2
(default), we need to feed the basename of the fasta file without its extension
1
2
3
4
hostile clean --index $INDEX \
--threads 16 --offline \
--fastq1 "$FILE_R1" --fastq2 "$FILE_R2" \
--out-dir cleaned-reads/ > removal_stats.json
redirect the standard output of hostile to store statistics in JSON format. See an example below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[
{
"version": "1.1.0",
"aligner": "bowtie2",
"index": "ref/masked_canis/masked_canis",
"options": [],
"fastq1_in_name": "S8_R1.fastq.gz",
"fastq1_in_path": "/qib/test-input/S8_R1.fastq.gz",
"fastq1_out_name": "S8_R1.clean_1.fastq.gz",
"fastq1_out_path": "test-host/S8_R1.clean_1.fastq.gz",
"reads_in": 24428576,
"reads_out": 24410474,
"reads_removed": 18102,
"reads_removed_proportion": 0.00074,
"fastq2_in_name": "S8_R2.fastq.gz",
"fastq2_in_path": "/qib/test-input/S8_R2.fastq.gz",
"fastq2_out_name": "S8_R2.clean_2.fastq.gz",
"fastq2_out_path": "test-host/S8_R2.clean_2.fastq.gz"
}
]