Viromics workshop

study design

We have three Illumina shotgun sequencing experiments from VLP enriched samples coming from human guts.

We performed a co-assembly (from this study ).

Get the reads

Skip this part if you are using the EBAME VM.

To download the files to your computer you need to download the raw sequences from NCBI and the assembly file ‘illumina_sample_pool_megahit.fa.gz’ from the Zenodo archive.

You can either go manually download these files or you can run the following script:

# We create a dedicated directory
mkdir virome-data
cd virome-data

# We can use a variable to bookmark the directory
export VIROME=$(pwd)

# Download the files
EBI="ftp://ftp.sra.ebi.ac.uk/vol1/fastq"
wget https://zenodo.org/api/records/10650983/files/illumina_sample_pool_megahit.fa.gz/content \
  -O illumina_sample_pool_megahit.fa.gz
curl -L "$EBI"/ERR679/005/ERR6797445/ERR6797445_1.fastq.gz -o ERR6797445_1.fastq.gz
curl -L "$EBI"/ERR679/005/ERR6797445/ERR6797445_2.fastq.gz -o ERR6797445_2.fastq.gz
curl -L "$EBI"/ERR679/004/ERR6797444/ERR6797444_1.fastq.gz -o ERR6797444_1.fastq.gz
curl -L "$EBI"/ERR679/004/ERR6797444/ERR6797444_2.fastq.gz -o ERR6797444_2.fastq.gz
curl -L "$EBI"/ERR679/003/ERR6797443/ERR6797443_1.fastq.gz -o ERR6797443_1.fastq.gz
curl -L "$EBI"/ERR679/003/ERR6797443/ERR6797443_2.fastq.gz -o ERR6797443_2.fastq.gz

On the EBAME VM

Try:

# View the content of the variable
echo $VIROME

# List what's inside it
ls -l $VIROME

You have three paired-end Illumina FASTQ files in this directory:

$VIROME/ERR679744*

And a single co-assembly of them:

$VIROME/human_gut_assembly.fa

Some questions

How many sequences are there in the assembly? What is its N50?
How many reads do we have in our samples?

With bash, you can count the “>” like:

grep -c ">" ${my_assembly}.fa

but since we have a gzip file for our assembly we can use a similar command “zgrep”

zgrep -c ">" $VIROME/human_gut_assembly.fa.gz

Or we can use a dedicated tool, called SeqFu stats :

seqfu stats -bn $VIROME/human_gut_assembly.fa.gz

How many reads in FASTQ files?

A quick way is to count the lines and divide them by four, or count the lines being “+”:

cat $VIROME/ERR6797443_1.fastq.gz | gzip -dc | grep -cw "+"

Previous submodule:

EBAME Setup

Next submodule:

Are you ready to start?