We have three Illumina shotgun sequencing experiments from VLP enriched samples coming from human guts.
We performed a co-assembly (from this study ).
Skip this part if you are using the EBAME VM.
To download the files to your computer you need to download the raw sequences from NCBI and the assembly file ‘illumina_sample_pool_megahit.fa.gz’ from the Zenodo archive.
You can either go manually download these files or you can run the following script:
# We create a dedicated directory
mkdir virome-data
cd virome-data
# We can use a variable to bookmark the directory
export VIROME=$(pwd)
# Download the files
EBI="ftp://ftp.sra.ebi.ac.uk/vol1/fastq"
wget https://zenodo.org/api/records/10650983/files/illumina_sample_pool_megahit.fa.gz/content \
-O illumina_sample_pool_megahit.fa.gz
curl -L "$EBI"/ERR679/005/ERR6797445/ERR6797445_1.fastq.gz -o ERR6797445_1.fastq.gz
curl -L "$EBI"/ERR679/005/ERR6797445/ERR6797445_2.fastq.gz -o ERR6797445_2.fastq.gz
curl -L "$EBI"/ERR679/004/ERR6797444/ERR6797444_1.fastq.gz -o ERR6797444_1.fastq.gz
curl -L "$EBI"/ERR679/004/ERR6797444/ERR6797444_2.fastq.gz -o ERR6797444_2.fastq.gz
curl -L "$EBI"/ERR679/003/ERR6797443/ERR6797443_1.fastq.gz -o ERR6797443_1.fastq.gz
curl -L "$EBI"/ERR679/003/ERR6797443/ERR6797443_2.fastq.gz -o ERR6797443_2.fastq.gz
Try:
# View the content of the variable
echo $VIROME
# List what's inside it
ls -l $VIROME
You have three paired-end Illumina FASTQ files in this directory:
$VIROME/ERR679744*
And a single co-assembly of them:
$VIROME/human_gut_assembly.fa
With bash, you can count the “>” like:
grep -c ">" ${my_assembly}.fa
but since we have a gzip file for our assembly we can use a similar command “zgrep”
zgrep -c ">" $VIROME/human_gut_assembly.fa.gz
Or we can use a dedicated tool, called SeqFu stats :
seqfu stats -bn $VIROME/human_gut_assembly.fa.gz
A quick way is to count the lines and divide them by four, or count the lines being “+”:
cat $VIROME/ERR6797443_1.fastq.gz | gzip -dc | grep -cw "+"