Published metagenome studies usually deposit the raw data (FASTQ) of the shotgun sequencing to a public repository, such as NCBI Short Reads Archive (SRA) or EBI European Nucleotide Archive (ENA).

Downloading FASTQ files from SRA

You can install a CLI tool called “SRA Toolkit” to download FASTQ files from SRA archives:

conda create -n getreads \
   -c bioconda -c conda-forge \
   sra-tools

for example to download the reads from ERR2231572:

fasterq-dump --verbose --skip-technical --split-files \
    --outdir "coffee-reads/" \
    --threads 8 \
    ERR2231572

Using a Nextflow pipeline

If you don’t have Nextflow, install it (it’s available from Conda, for example)

The nf-core consortium curates a pipeline called nf-core/fetchngs that allows to parallelise and download multiple accessions with ease.

You will need to generate a list of IDs, let’s call it for example coffee_ids.csv:

ERR2231567
ERR2231568
ERR2231569
ERR2231570
ERR2231571
ERR2231572

:warning: the file must not contain empty lines or unwanted spaces.

To run the pipeline you should have a container system, either Docker or Apptainer (singularity). Assuming we will use Docker:

nextflow run nf-core/fetchngs \
  --input list.csv \
  --outdir coffee-reads \
  -profile docker

In Nextflow, parameters starting with double dash are fed to the pipeline itself, while the one with a single dash (like -profile) are interpreted by Nextflow itself.

The output directory contains these subdirectories:

  • metadata: contains a metadata TSV file for each accession
  • samplesheet: this directory contains collated tables with data from each sample
  • fastq: this directory contains the FASTQ files
  • md5: this directory contains MD5 checksum files for the FASTQ files. These checksums are used to verify the integrity of the files, ensuring that they have not been corrupted or altered during processing.
  • pipeline_infols : This directory contains information about the pipeline execution (logs)

Next submodule: