We already introduced fastp as a versatile and fast tool to perform quality filtering and trimming of sequencing reads.

A typical command to filter paired-end reads is the following:

fastp -i sample_R1.fastq -I sample_R2.fastq \
    -o sample_trimmed_R1.fastq -O sample_trimmed_R2.fastq \
    --detect_adapter_for_pe \
    --cut_tail -q 20 -u 30 -l 50 -w 8 \
    -h fastp_report.html -j fastp_report.json
  • -i and -I: input files (R1 and R2)
  • -o and -O: output files (R1 and R2)
  • --detect_adapter_for_pe: automatically detect adapters for paired-end data
  • --cut_tail: cut low-quality bases from the 3’ end
  • -q 20: quality threshold for cutting (Phred score)
  • -u 30: maximum percentage of bases allowed to be below quality threshold (30%)
  • -l 50: minimum length of reads to keep after filtering (50 bp)
  • -w 8: number of threads to use (8)
  • -h and -j: output HTML and JSON reports

A loop

In previous workshops we introduced the basics of Bash scripting, including loops.

Here we can use:

  • Variables
  • a for loop to process all the samples one by one
  • a basic if to check if a file exists
# Set your input and output directory paths/names
RAW_READS=./reads/
OUTDIR=./filt-reads/

mkdir -p "$OUTDIR"/logs

for r1 in "$RAW_READS"/*_1.fastq.gz; do
    # Infer the name of R2
    r2=${r1/_1.fastq.gz/_2.fastq.gz}
    # Extract the "sample name"
    base=$(basename "$r1" _1.fastq.gz)
    
    # Check if R2 exists 
    if [[ ! -e $r2 ]]; then
        echo "Skipping $basename: R2 not found in $r2"
        continue
    fi

    # Run fastp if the report was not found else assume it was done previously
    if [[ ! -e  "$OUTDIR"/logs/${base}.fastp.html ]]; 
    then
        fastp -i "$r1" -I "$r2" \
          -o "$OUTDIR"/${base}_1.fastq.gz \
          -O "$OUTDIR"/${base}_2.fastq.gz \
          --detect_adapter_for_pe --cut_tail -q 20 -u 30 -l 50 -w 8 \
          -h "$OUTDIR"/logs/${base}.fastp.html \
          -j "$OUTDIR"/logs/${base}.fastp.json
    else
        echo "WARNING: Skipping $r1 because $OUTDIR/logs/${base}.fastp.html was found"
    fi
done

Next submodule: