What are FASTA and FASTQ?
FASTA is the simplest sequence format. Each record is two lines:
a header (starting with >) followed by the nucleotide
or amino-acid sequence.
>read_001 some optional comment
ACGTACGTACGTACGTACGT
FASTQ extends FASTA with per-base quality scores. Every record spans four lines:
@read_001 some optional comment
ACGTACGTACGTACGTACGT
+
IIIIIIIIIIIIIIIIIIII
Line 3 is always a bare +. Line 4 encodes Phred quality scores as
ASCII characters — subtract 33 from the character code to get the integer score
(e.g., 'I' = ASCII 73 − 33 = Q40, meaning a 1-in-10,000 base-call
error probability). ReadFX handles the conversion for you.
Files are almost always gzip-compressed (.fastq.gz). ReadFX
decompresses transparently — you don't need to gunzip first.
Installation
Install the library via Nimble:
nimble install readfx
Or add it to your project's .nimble file:
requires "readfx >= 0.2.0"
Then in your source:
import readfx
That single import brings in all types, parsers, and utilities.
Reading your first file
The easiest way to iterate over records is readFQ. It yields one
FQRecord per sequence — think of it as a Nim for loop
over the file.
import readfx
for record in readFQ("sample.fastq.gz"):
echo record.name, " (", record.sequence.len, " bp)"
Works equally well on FASTA files and on uncompressed files. To read from
standard input, pass "-" as the path:
for record in readFQ("-"):
echo record.name
--opt:speed --gc:arc to your compile command.
This can double throughput with no code changes.
Understanding the FQRecord
Every record returned by readFQ is an FQRecord object
with four string fields:
| Field | Type | Description |
|---|---|---|
name |
string |
Sequence identifier (everything up to the first space on the header line) |
comment |
string |
Optional free-text after the name (may be empty) |
sequence |
string |
Nucleotide or amino-acid sequence |
quality |
string |
Phred quality string (empty for FASTA records) |
import readfx
for record in readFQ("sample.fastq.gz"):
echo "Name: ", record.name
echo "Comment: ", record.comment
echo "Sequence: ", record.sequence
echo "Length: ", record.sequence.len
echo "Quality: ", record.quality # empty if FASTA
echo "---"
Because the fields are plain Nim strings, you can use the entire Nim standard
library on them — strutils, sequtils, regex, etc.
Working with sequences
ReadFX exports a collection of sequence-manipulation procedures that operate
directly on FQRecord objects (or plain strings).
Reverse complement
let rc = revCompl("ATGCCC") # → "GGGCAT"
var rec: FQRecord = ...
revCompl(rec) # modify in place (also reverses quality)
let copy = revCompl(rec) # get a new reversed copy
GC content
let gc = gcContent(record) # returns a float, 0.0–1.0
echo "GC: ", (gc * 100).int, "%"
Nucleotide composition
let comp = composition(record)
echo "A=", comp.A, " C=", comp.C, " G=", comp.G, " T=", comp.T
echo "N=", comp.N, " GC=", comp.GC
Quality trimming
Remove low-quality bases from the 3′ end. The record's sequence
and quality strings are both shortened in place.
var r = record # make a mutable copy
qualityTrim(r, 20) # trim bases with Phred score < 20
echo r.sequence.len # may be shorter now
Masking low-quality bases
maskLowQuality(r, 20) # replace Q<20 with 'N'
maskLowQuality(r, 15, maskChar='X') # or any other character
Subsequence extraction
let first50 = subSequence(record, 0, 50) # first 50 bases
let fromPos = subSequence(record, 10) # from base 10 to end
let trimmed = trimStart(record, 5) # remove 5 bases from 5' end
let clipped = trimEnd(record, 3) # remove 3 bases from 3' end
Full worked example
Putting it all together: read a file, filter short reads, trim quality, report per-record statistics.
import readfx, strutils
const minLength = 50
const minQual = 20
for record in readFQ("sample.fastq.gz"):
# Skip short sequences
if record.sequence.len < minLength:
continue
# Work on a mutable copy
var r = record
qualityTrim(r, minQual)
# Skip reads that became too short after trimming
if r.sequence.len < minLength:
continue
let comp = composition(r)
let avg = avgQuality(r)
echo r.name,
"\tlen=", r.sequence.len,
"\tGC=", (comp.GC * 100).formatFloat(ffDecimal, 1), "%",
"\tQ=", avg.formatFloat(ffDecimal, 1)
Paired-end reads
Illumina sequencing typically produces two files per sample: R1 (forward reads)
and R2 (reverse reads). The readFQPair iterator reads both files
in lockstep, yielding an FQPair containing both mates.
import readfx
for pair in readFQPair("sample_R1.fastq.gz", "sample_R2.fastq.gz"):
echo "R1: ", pair.read1.name, " (", pair.read1.sequence.len, " bp)"
echo "R2: ", pair.read2.name, " (", pair.read2.sequence.len, " bp)"
If the files have different numbers of records, an IOError is raised.
Enable name validation to catch mismatched or scrambled files:
for pair in readFQPair("R1.fastq.gz", "R2.fastq.gz", checkNames = true):
# raises ValueError if read names don't match
# strips /1 /2 suffixes automatically before comparing
discard
Choosing the right parser
ReadFX exposes three single-file parsers. Here's the mental model:
| Iterator | Returns | When to use |
|---|---|---|
readFQ |
FQRecord |
Start here. Yields safe Nim strings; records can be stored in sequences, passed to functions, etc. |
readFQPtr |
FQRecordPtr |
Processing tens of millions of reads where allocation overhead is measurable. Do not store pointers across iterations. |
readFastx |
fills FQRecord |
Custom I/O workflows — e.g. interleaving reads from two different sources in a single loop. |
FQRecordPtr fields is reused on every
iteration. Converting to a Nim string with $record.name is safe,
but storing the raw pointer is not. If in doubt, use readFQ.
Performance tips
-
Compile with
--opt:speed --gc:arcfor best throughput. ARC's deterministic memory management eliminates GC pauses. -
Prefer
readFQPtrwhen you only need a few fields (e.g., just the sequence) and don't need to store records. - Avoid re-allocating inside tight loops — compute expensive operations outside of the hot path when possible.
- The library auto-detects gzip vs. plain text; no penalty for checking.