Dadaist2: a first tutorial
Updated for version 1.3.0
This tutorial aims at familiarising with the programme, but relies on a very short and noisy dataset. To fully test the pipeline we recommend a well established dataset such as "Mothur SOP", see the full tutorial
Get ready
Install Dadaist2 and activate the Miniconda environment (if needed).
For this tutorial we will analyze three small samples (also present in the repository). This will create a ./data
directory.
wget "https://github.com/quadram-institute-bioscience/dadaist2/releases/download/v1.2.4/data.zip"
unzip data.zip && rm data.zip
Let's start checking the number of reads per sample. The reads are in data/16S
:
seqfu count --basename data/16S/*.gz
This will tell the number of reads, checking that the forward (R1) and reverse (R2) pair have the same amount of reads. This should be the output produced:
F99_S0_L001_R1_001.fastq.gz 4553 Paired
A01_S0_L001_R1_001.fastq.gz 6137 Paired
A02_S0_L001_R1_001.fastq.gz 5414 Paired
Sample names should not begin with numbers (eg: 1_R1.fastq.gz
). This will be prohibited in a future release, and will trigger a warning in the current release.
Download a reference database
Dadaist2 provides a convenient tool to download some pre-formatted reference databases. To have a list of the available to download:
dadaist2-getdb --list
This should produce something like:
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ dadaist2-getdb 1.2.5 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Code ┃ Description ┃ Paper ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dada2-refseq-2020 │ RefSeq+RDP: This database contains 22433 bacterial, │ Alishum 2021 │
│ │ 1055 archaea and 99 eukaryotic full lengths16S │ │
│ │ (19/07/2020) │ │
│ dada2-hitdb │ HITdb is a reference taxonomy for Human Intestinal 16S │ Ritari 2015 │
│ │ rRNA genes │ │
│ decipher-silva-138 │ SILVA release 138 (Decipher) │ Quast 2013 │
│ dada2-unite │ UNITE database for ITS │ Nilsson 2018 │
│ dada2-gtdb-2018 │ GTDB 20486 bacteria and 1073 archaea full 16S rRNA │ Alishum 2019 │
│ │ gene sequences. (20/11/2018) │ │
│ decipher-unite-2020 │ UNITE 2020 (Decipher) │ Nilsson 2018 │
│ dada2-gtdb-2020 │ GTDB 21965 bacteria and 1126 archaea full 16S rRNA │ Alishum 2021 │
│ │ gene sequences. (19/07/2020) │ │
│ testset │ FASTQ input, Small 16S dataset to test the suite │ │
│ dada2-rdp-train-16 │ RDP taxonomic training data formatted for DADA2 (RDP │ Cole 2014 │
│ │ trainset 16/release 11.5) │ │
│ dada2-silva-138 │ SILVA release 138 │ Quast 2013 │
│ decipher-gtdb95 │ GTDB │ Chaumeil 2019 │
└────────────────────────┴────────────────────────────────────────────────────────┴───────────────┘
Some reference files are for DADA2, others are for DECIPHER. Dadaist2 will automatically select the correct classifier based on the extension of the reference database.
The keyword before the column is the dataset name, to download it we need to choose a destination directory, we can for example make a refs
subdirectory:
mkdir -p refs
dadaist2-getdb -d decipher-silva-138 -o ./refs
This will place SILVA_SSU_r138_2019.RData
in the output directory.
Generate a mapping file
A metadata file is not mandatory, but it's easy to generate one to be extended with more columns if needed.
dadaist2-metadata -i data/16S > metadata.tsv
This should generate a file called metadata.tsv with the following content:
#SampleID Files
A01 A01_S0_L001_R1_001.fastq.gz,A01_S0_L001_R2_001.fastq.gz
A02 A02_S0_L001_R1_001.fastq.gz,A02_S0_L001_R2_001.fastq.gz
F99 F99_S0_L001_R1_001.fastq.gz,F99_S0_L001_R2_001.fastq.gz
Run the analysis
Dadaist2 provides options to:
- select the QC strategy (fastp, cutadapt of seqfu)
- select the taxonomy classifier (DECIPHER or DADA2 naive classifier)
- adjust various steps via command line parameters
As a first run, we recommend using the default parameters:
dadaist2 -i data/16S/ -o example-output -d refs/SILVA_SSU_r138_2019.RData -t 8 -m metadata.tsv --verbose
Briefly:
-
-i
points to the input directory containing paired end reads (by default recognised by_R1
and_R2
tags, but this can be customised) -
-o
is the output directory -
-d
is the reference database in DADA2 or DECIPHER format (we downloaded a DECIPHER database) -
-m
link to the metadata file (if not supplied a blank one will be generated and used) -
-t
is the number of processing threads -
--verbose
will print more information about the analysis
Warnings and errors
Dadaist2 will print the following warning:
DADA2 filtered too many reads: 9.0102% from total 16104 to 1451
We can inspect the filtering steps in the output directory (dada2_stats.tsv
) to check the steps with the highest loss (can differs slightly):
Sample name | input | filtered | denoised | merged | non-chimeric |
---|---|---|---|---|---|
A01 | 6137 | 2291 | 2291 | 1924 | 765 |
A02 | 5414 | 2079 | 2079 | 1865 | 413 |
F99 | 4553 | 1770 | 1770 | 1517 | 273 |
If, for example, the highest loss is at "merged", it means we truncated too much and we didn't have overlap between the reads. In this case a good loss is at the first step (filtered), as these sample reads are not of very high quality and are just used to test the pipeline.
Now there is a dadaist2-checkstats
tool to identify the steps causing the biggest loss.
In this datasets the primers were not removed. There are two ways to fix this:
- Use the primer sequences with
--primers CCTACGGGNGGCWGCAGTNG:GACTACNNGGGTATCTAATC
(forward:reverse) - Trim fixed lengths from forward and reverse reads with
--trim-primer-for 20
and--trim-primer-rev 20
(or--s1 20
and--s2 20
in shorter form)
A real dataset
If the pipeline ended as expected, it means you are ready to run it with real samples as described in another tutorial.
The output directory
Notable files:
- rep-seqs.fasta representative sequences (ASVs) in FASTA format
- rep-seqs-tax.fasta representative sequences (ASVs) in FASTA format, with taxonomy labels as comments
- feature-table.tsv table of raw counts (after cross-talk removal if specified)
- taxonomy.tsv a text file with the taxonomy of each ASV (used to add the labels to the rep-seqs-tax.fasta)
- copy of the metadata.tsv file
Subdirectories:
- MicrobiomeAnalyst a set of files formatted to be used with the online (also available offline as R package) software MicrobiomeAnalyst.
- Rhea a directory with files to be used with the Rhea pipeline, as well as some pre-calculated outputs (Normalization and Alpha diversity are done by default, as they don't require knowledge about metadata categories)
- R a directory with the PhyloSeq object
Note that the MicrobiomeAnalyst and Rhea directories are only generated if DADA2 didn't filter too many reads. To try producing them anyway, add --force
(not recommended).