Other tutorials
"Mothur SOP" tutorial
The "MiSeq SOP" tutorial by Pat Schloss is a well-known dataset coming from the paper:
Kozich J. J. _et al.: (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology link.
We will use this dataset to show how easy is to reproduce the main results with dadaist.
Get the dataset
The dataset consits of 20 paired-end FASTQ files coming from the full dataset of the paper.
wget "https://mothur.s3.us-east-2.amazonaws.com/wiki/miseqsopdata.zip"
unzip miseqsopdata.zip
We can use SeqFu to check the number of reads:
seqfu stats -n MiSeq_SOP/*R1*.fastq
This will produce the following table:
┌─────────────────────────────────────────┬───────┬──────────┬───────┬─────┬─────┬─────┬─────┬─────┐
│ File │ #Seq │ Total bp │ Avg │ N50 │ N75 │ N90 │ Min │ Max │
├─────────────────────────────────────────┼───────┼──────────┼───────┼─────┼─────┼─────┼─────┼─────┤
│ MiSeq_SOP/F3D0_S188_L001_R1_001.fastq │ 7793 │ 1955914 │ 251.0 │ 251 │ 251 │ 251 │ 249 │ 251 │
│ MiSeq_SOP/F3D141_S207_L001_R1_001.fastq │ 5958 │ 1495347 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D142_S208_L001_R1_001.fastq │ 3183 │ 798912 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D143_S209_L001_R1_001.fastq │ 3178 │ 797624 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D144_S210_L001_R1_001.fastq │ 4827 │ 1211542 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D145_S211_L001_R1_001.fastq │ 7377 │ 1851563 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D146_S212_L001_R1_001.fastq │ 5021 │ 1260202 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D147_S213_L001_R1_001.fastq │ 17070 │ 4284410 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D148_S214_L001_R1_001.fastq │ 12405 │ 3113525 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D149_S215_L001_R1_001.fastq │ 13083 │ 3283570 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D150_S216_L001_R1_001.fastq │ 5509 │ 1382628 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D1_S189_L001_R1_001.fastq │ 5869 │ 1473024 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D2_S190_L001_R1_001.fastq │ 19620 │ 4924442 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D3_S191_L001_R1_001.fastq │ 6758 │ 1696205 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D5_S193_L001_R1_001.fastq │ 4448 │ 1116401 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D6_S194_L001_R1_001.fastq │ 7989 │ 2005171 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D7_S195_L001_R1_001.fastq │ 5129 │ 1287342 │ 251.0 │ 251 │ 251 │ 251 │ 248 │ 251 │
│ MiSeq_SOP/F3D8_S196_L001_R1_001.fastq │ 5294 │ 1328727 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/F3D9_S197_L001_R1_001.fastq │ 7070 │ 1774487 │ 251.0 │ 251 │ 251 │ 251 │ 250 │ 251 │
│ MiSeq_SOP/Mock_S280_L001_R1_001.fastq │ 4779 │ 1199027 │ 250.9 │ 251 │ 251 │ 250 │ 249 │ 251 │
└─────────────────────────────────────────┴───────┴──────────┴───────┴─────┴─────┴─────┴─────┴─────┘
Get the reference database
If you already downloaded a suitable reference database, you can skip this step. We will use SILVA 138 (that is available both for DECIPHER and DADA2 naive classifier). In this example we'll use the "DADA2" version:
First, we can check which SILVA database are available for download:
dadaist2-getdb --list silva
Then we can download one from the list (or all of them). The following command will download the ‘dada2-silva-138' database in our home directory, under the refs
subdirectory.
dadaist2-getdb -d dada2-silva-138 -o ~/refs/
Refactoring the metadata file
The dataset comes with a tabular file with metadata called MiSeq_SOP/mouse.time.design
. We can rename the header and this will suffice to work with Dadaist2 (that expected a #SampleID
column):
sed 's/group/#SampleID/g' MiSeq_SOP/mouse.time.design > metadata.tsv
Note that the metadata is missing an entry for the mock community: either add it to the metadata or remove it from the dataset.
echo -e "Mock\tControl" >> metadata.tsv
Running Dadaist2
The minimum parameter to run Dadaist are:
-i
: directory with the input reads, with_R1
and_R2
tags-o
: output directory Recommended parameters are:-d
: to specify a reference database (will enable taxonomy annotation)-m
: metadata file (if not provided a blank metadata file is generated)-t
: number of threads
dadaist2 -i MiSeq_SOP/ -o dadaist2-sop -m metadata.tsv -d ~/refs/silva_nr_v138_train_set.fa.gz
Exploring the dataset with Microbiome Analyst
The output folder contains a subdirectory called MicrobiomeAnalyst. We can go to MicrobiomeAnalyst and upload our table, taxonomy and metadata:
After deciding if and how to normalize or rescale our data (we used the defaults settings here), you'll land on the tool box page:
We can perform a PCA analysis and check if there is a separation between the "Early" and "Late" samples, as shown in the original paper:
Documentation
This section is under costruction
In this website: