Andrea Telatin
Andrea Telatin Head of Bioinformatics at the Quadram Institute Bioscience, Norwich.

Removing custom hosts from metagenomics

Removing custom hosts from metagenomics

Hostile is a fantastic tool for removing host reads from whole metagenome shotgun datasets. It’s very user-friendly and has pre-built databases for human and murine reads decontaminating.

In this post we focus on using a custom reference database (in this case, Canis lupus familiaris).

A key step is masking the reference, that is removing from the reference regions similar to bacterial (viral, fungal…) regions, as we do not want to risk discarding prokaryotic reads.

Getting bacterial reads

Hostile shares a table of bacterial reference genomes and we will use it to download the genomes to treat as false positive matches. Of course, each one can add more sequences to mask.

From that page we can download a table like:

Assembly BioSample Strain Taxonomy
GCA_013267415.1 SAMN11056500 FDAARGOS_785 Abiotrophia defectiva
GCA_016127315.1 SAMN16357219 FDAARGOS_1050 Achromobacter deleyi
GCA_013267395.1 SAMN11056501 FDAARGOS_786 Achromobacter denitrificans

This table contains the accession number as the first field. We can download all the genomes using a dedicated script.

When we have a set of separate FASTA file, we can combine them, for example with SeqFu (especially if we think multiple files can have sequences with the same name: we can preprend a prefix to avoid collisions):

1
seqfu cat --basename *.fasta > all_genomes.fasta

Masking the reference

We need the host genome (e. g. the dog genome) and our non-host genomes (i. e. the “all_genomes.fasta” from the previous step). Using hostile:

1
2
3
4
5
# Let's use some variables
HOST_GENOME=/path/to/dog.fasta
BACTERIAL_GENOMES=./genomes/all_genomes.fasta

hostile mask --threads 8 --out-dir masked_host_genome/ "$HOST_GENOME" "$BACTERIAL_GENOMES"

:bulb: List the files in the output directory to see what files were generated

Run hostile with the masked reference

We need to specify the right index name depending on the mapping method. When using bowtie2 (default), we need to feed the basename of the fasta file without its extension

1
2
3
4
hostile clean --index $INDEX \
  --threads 16 --offline \
  --fastq1 "$FILE_R1" --fastq2 "$FILE_R2" \
  --out-dir cleaned-reads/ > removal_stats.json

:bulb: redirect the standard output of hostile to store statistics in JSON format. See an example below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "ref/masked_canis/masked_canis",
        "options": [],
        "fastq1_in_name": "S8_R1.fastq.gz",
        "fastq1_in_path": "/qib/test-input/S8_R1.fastq.gz",
        "fastq1_out_name": "S8_R1.clean_1.fastq.gz",
        "fastq1_out_path": "test-host/S8_R1.clean_1.fastq.gz",
        "reads_in": 24428576,
        "reads_out": 24410474,
        "reads_removed": 18102,
        "reads_removed_proportion": 0.00074,
        "fastq2_in_name": "S8_R2.fastq.gz",
        "fastq2_in_path": "/qib/test-input/S8_R2.fastq.gz",
        "fastq2_out_name": "S8_R2.clean_2.fastq.gz",
        "fastq2_out_path": "test-host/S8_R2.clean_2.fastq.gz"
    }
]

comments powered by Disqus