We already introduced fastp as a versatile and fast tool to perform quality filtering and trimming of sequencing reads.
A typical command to filter paired-end reads is the following:
fastp -i sample_R1.fastq -I sample_R2.fastq \
-o sample_trimmed_R1.fastq -O sample_trimmed_R2.fastq \
--detect_adapter_for_pe \
--cut_tail -q 20 -u 30 -l 50 -w 8 \
-h fastp_report.html -j fastp_report.json
-i and -I: input files (R1 and R2)-o and -O: output files (R1 and R2)--detect_adapter_for_pe: automatically detect adapters for paired-end data--cut_tail: cut low-quality bases from the 3’ end-q 20: quality threshold for cutting (Phred score)-u 30: maximum percentage of bases allowed to be below quality threshold (30%)-l 50: minimum length of reads to keep after filtering (50 bp)-w 8: number of threads to use (8)-h and -j: output HTML and JSON reportsIn previous workshops we introduced the basics of Bash scripting, including loops.
Here we can use:
for loop to process all the samples one by oneif to check if a file exists# Set your input and output directory paths/names
RAW_READS=./reads/
OUTDIR=./filt-reads/
mkdir -p "$OUTDIR"/logs
for r1 in "$RAW_READS"/*_1.fastq.gz; do
# Infer the name of R2
r2=${r1/_1.fastq.gz/_2.fastq.gz}
# Extract the "sample name"
base=$(basename "$r1" _1.fastq.gz)
# Check if R2 exists
if [[ ! -e $r2 ]]; then
echo "Skipping $basename: R2 not found in $r2"
continue
fi
# Run fastp if the report was not found else assume it was done previously
if [[ ! -e "$OUTDIR"/logs/${base}.fastp.html ]];
then
fastp -i "$r1" -I "$r2" \
-o "$OUTDIR"/${base}_1.fastq.gz \
-O "$OUTDIR"/${base}_2.fastq.gz \
--detect_adapter_for_pe --cut_tail -q 20 -u 30 -l 50 -w 8 \
-h "$OUTDIR"/logs/${base}.fastp.html \
-j "$OUTDIR"/logs/${base}.fastp.json
else
echo "WARNING: Skipping $r1 because $OUTDIR/logs/${base}.fastp.html was found"
fi
done