Tutorial — ReadFX

What are FASTA and FASTQ?

FASTA is the simplest sequence format. Each record is two lines: a header (starting with >) followed by the nucleotide or amino-acid sequence.

fasta

>read_001 some optional comment
ACGTACGTACGTACGTACGT

FASTQ extends FASTA with per-base quality scores. Every record spans four lines:

fastq

@read_001 some optional comment
ACGTACGTACGTACGTACGT
+
IIIIIIIIIIIIIIIIIIII

Line 3 is always a bare +. Line 4 encodes Phred quality scores as ASCII characters — subtract 33 from the character code to get the integer score (e.g., 'I' = ASCII 73 − 33 = Q40, meaning a 1-in-10,000 base-call error probability). ReadFX handles the conversion for you.

Files are almost always gzip-compressed (.fastq.gz). ReadFX decompresses transparently — you don't need to gunzip first.

Installation

Install the library via Nimble:

shell

nimble install readfx

Or add it to your project's .nimble file:

nimble

requires "readfx >= 0.2.0"

Then in your source:

nim

import readfx

That single import brings in all types, parsers, and utilities.

Reading your first file

The easiest way to iterate over records is readFQ. It yields one FQRecord per sequence — think of it as a Nim for loop over the file.

nim

import readfx

for record in readFQ("sample.fastq.gz"):
  echo record.name, " (", record.sequence.len, " bp)"

Works equally well on FASTA files and on uncompressed files. To read from standard input, pass "-" as the path:

nim

for record in readFQ("-"):
  echo record.name

Compile with optimisations For production use, add --opt:speed --gc:arc to your compile command. This can double throughput with no code changes.

Understanding the FQRecord

Every record returned by readFQ is an FQRecord object with four string fields:

Field	Type	Description
`name`	`string`	Sequence identifier (everything up to the first space on the header line)
`comment`	`string`	Optional free-text after the name (may be empty)
`sequence`	`string`	Nucleotide or amino-acid sequence
`quality`	`string`	Phred quality string (empty for FASTA records)

nim

import readfx

for record in readFQ("sample.fastq.gz"):
  echo "Name:     ", record.name
  echo "Comment:  ", record.comment
  echo "Sequence: ", record.sequence
  echo "Length:   ", record.sequence.len
  echo "Quality:  ", record.quality     # empty if FASTA
  echo "---"

Because the fields are plain Nim strings, you can use the entire Nim standard library on them — strutils, sequtils, regex, etc.

Working with sequences

ReadFX exports a collection of sequence-manipulation procedures that operate directly on FQRecord objects (or plain strings).

Reverse complement

nim

let rc = revCompl("ATGCCC")   # → "GGGCAT"

var rec: FQRecord = ...
revCompl(rec)                  # modify in place (also reverses quality)
let copy = revCompl(rec)       # get a new reversed copy

GC content

nim

let gc = gcContent(record)   # returns a float, 0.0–1.0
echo "GC: ", (gc * 100).int, "%"

Nucleotide composition

nim

let comp = composition(record)
echo "A=", comp.A, " C=", comp.C, " G=", comp.G, " T=", comp.T
echo "N=", comp.N, " GC=", comp.GC

Quality trimming

Remove low-quality bases from the 3′ end. The record's sequence and quality strings are both shortened in place.

nim

var r = record           # make a mutable copy
qualityTrim(r, 20)       # trim bases with Phred score < 20
echo r.sequence.len      # may be shorter now

Masking low-quality bases

nim

maskLowQuality(r, 20)              # replace Q<20 with 'N'
maskLowQuality(r, 15, maskChar='X')  # or any other character

Subsequence extraction

nim

let first50 = subSequence(record, 0, 50)   # first 50 bases
let fromPos = subSequence(record, 10)      # from base 10 to end
let trimmed = trimStart(record, 5)         # remove 5 bases from 5' end
let clipped  = trimEnd(record, 3)          # remove 3 bases from 3' end

Full worked example

Putting it all together: read a file, filter short reads, trim quality, report per-record statistics.

nim

import readfx, strutils

const minLength = 50
const minQual   = 20

for record in readFQ("sample.fastq.gz"):
  # Skip short sequences
  if record.sequence.len < minLength:
    continue

  # Work on a mutable copy
  var r = record
  qualityTrim(r, minQual)

  # Skip reads that became too short after trimming
  if r.sequence.len < minLength:
    continue

  let comp = composition(r)
  let avg  = avgQuality(r)

  echo r.name,
       "\tlen=", r.sequence.len,
       "\tGC=",  (comp.GC * 100).formatFloat(ffDecimal, 1), "%",
       "\tQ=",   avg.formatFloat(ffDecimal, 1)

Paired-end reads

Illumina sequencing typically produces two files per sample: R1 (forward reads) and R2 (reverse reads). The readFQPair iterator reads both files in lockstep, yielding an FQPair containing both mates.

nim

import readfx

for pair in readFQPair("sample_R1.fastq.gz", "sample_R2.fastq.gz"):
  echo "R1: ", pair.read1.name, "  (", pair.read1.sequence.len, " bp)"
  echo "R2: ", pair.read2.name, "  (", pair.read2.sequence.len, " bp)"

If the files have different numbers of records, an IOError is raised. Enable name validation to catch mismatched or scrambled files:

nim

for pair in readFQPair("R1.fastq.gz", "R2.fastq.gz", checkNames = true):
  # raises ValueError if read names don't match
  # strips /1 /2 suffixes automatically before comparing
  discard

Choosing the right parser

ReadFX exposes three single-file parsers. Here's the mental model:

Iterator	Returns	When to use
`readFQ`	`FQRecord`	Start here. Yields safe Nim strings; records can be stored in sequences, passed to functions, etc.
`readFQPtr`	`FQRecordPtr`	Processing tens of millions of reads where allocation overhead is measurable. Do not store pointers across iterations.
`readFastx`	fills `FQRecord`	Custom I/O workflows — e.g. interleaving reads from two different sources in a single loop.

readFQPtr warning The C-level buffer backing FQRecordPtr fields is reused on every iteration. Converting to a Nim string with $record.name is safe, but storing the raw pointer is not. If in doubt, use readFQ.

Performance tips

Compile with --opt:speed --gc:arc for best throughput. ARC's deterministic memory management eliminates GC pauses.
Prefer readFQPtr when you only need a few fields (e.g., just the sequence) and don't need to store records.
Avoid re-allocating inside tight loops — compute expensive operations outside of the hot path when possible.
The library auto-detects gzip vs. plain text; no penalty for checking.

Next steps

Parsing methods

Detailed comparison of all four iterators with benchmarks and use-case guidance.

Data structures

Full type definitions for FQRecord, FQPair, SeqComp, Bufio, and Interval.

Utilities reference

Every sequence and quality procedure with signatures and examples.

API index

Generated symbol index for the full library — searchable by name.