We will use a simple approach to inspect the mapping of reads to a reference genome. Creating manually reads to map, we will be able to better understand what a mapper can do for us.
When we need to feed input files or output files, to be sure we are feeding the right files, and finally to check that the output files makes sense.
A mapping file will need as inputs:
The output will be a SAM file (or directly a BAM file), which is a text file that contains the alignment of the reads to the reference genome.
How can we inspect a file?
less -S
to see if the format looks reasonable (or gzip -dc FILE.gz | less -S
if the file is compressed)seqfu
for sequences, samtools
for BAM files)mamba create -n mapping \
-c conda-forge -c bioconda \
bwa samtools seqfu visidata
bwa
is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.samtools
is a suite of programs for interacting with SAM files, including to convert SAM to BAM and viceversa.seqfu
is a tool to manipulate sequences, including reverse complement, translation, and more.vd
)As usual, to be able to use these tools we need to:
conda activate mapping
we used the reference from the learn_bash repository:
git clone https://github.com/telatin/learn_bash.git
We will use a small dataset to test the mapping. We can systematically create such datasets, but to make an easier start we can do this manually, creating a valid FASTA file
vir_genome.fna
fileand so and forth. When done let’s save the file as small_reads.fasta
.
>Read1
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTC
>Read2
CTGAAACGGGATATCATCAAAGCCATGAACAAAGCAGCCGCGCTGGATGAACTGATACCGGGGTTGCTGAG
>Read3_largegap
CGAACAGTCAGGTTAACAGGCTGCGGCATTTTGTCGGCTCTCCGACGTTCCTGGGTGACAAGCGTATTGAAG
>Read2_insert
GAAACGGGATATCATCAAAGCCATGAACAAAGCatcgAGCCGCGCTGGATGAACTGATACCGGGGTTGCTGAG
...
All this reads are taken from the reference, in the forward strand. To also check the effect of query sequences in the opposite strand we can append at the bottom of the file the reverse complement of the sequences:
seqfu rc small_reads.fasta >> small_reads_rc.fasta
The final file is available here.
We can use seqfu
to check the sequences in the file:
seqfu stats small_reads.fasta
# Check how long each read is
seqfu cat --add-len small_reads.fasta | grep ">"
With bwa we need to create an index of the reference genome:
bwa index path/to/vir_genome.fna
And then we can map the reads. bwa output will be a SAM output, that we will redirect to a file.
bwa mem path/to/vir_genome.fna small_reads.fasta > small.sam
Let’s start with less -S
:
less -S small.sam
We notice that there are some header lines starting by @
.
These are metadata lines that describe the reference genome and the reads.
In particular you will see one line for each reference sequence (in our case we have only one),
that stores its length:
@SQ SN:NC_001416.1 LN:48502
Then we have a tabular output with these columns:
To view the main part of the SAM file as a table we can create a header, and then append the content of the sam file (excluding the header!) at the bottom:
echo -e "QNAME\tFLAG\tRNAME\tPOS\tMAPQ\tCIGAR\tRNEXT\tPNEXT\tTLEN\tSEQ\tQUAL" > sam_table.tsv
grep -v "^@" small.sam >> sam_table.tsv
vd sam_table.tsv
We should now see the SAM in a table format, where we can sort the columns, filter, and more.
There’s plenty of material on the SAM format:
To try with a bigger dataset:
We can quickly notice that:
Using samtools
we can convert the SAM file to a BAM file:
# Convert SAM to BAM: -b means create a "bam" file as output
samtools view -b fake.sam > fake.bam
We can sort the output file by position:
samtools sort -o fake.sorted.bam fake.bam
samtools index fake.sorted.bam
And then we can get some statistics from the alignment:
samtools flagstat fake.sorted.bam
Using the pipe we can directly create a sorted bam:
bwa mem path/to/vir_genome.fna small_reads.fasta | samtools sort --write-index -o piped.bam -