As you perform your oxford nanopore sequencing run, the sequence data is being read as electrical squiggles, and this is converted to DNA bases in real time, in a process known as basecalling.
Basecalling uses an algorithmn to convert electrical signals into nucleotide sequences, read more about basecalling here
The software used by the sequencing computer, MinKNOW, has an inbuilt basecalling algorithmn, which is good for analysing your data rapidly.
However, we have specific questions, and want to use a more accurate basecaller, Dorado to basecall our raw data which we can then analyse.
This worksheet will guide you through the steps in using Dorado to basecall your sequence data, please follow the links in this page for your further analysis steps
BASECALLING - DORADO
1. Familiarise yourself with the working environment
First you will need to familiarise yourself with your data output and the organisation of your files, have a look in the server in your run output folder (e.g. AINP001) using ‘ls’ to list the contents of directories.
Your run output folder should look something like this:
The basecalled reads, which MinKNOW generated using a speedy (less acurrate) basecaller are within the fastq_pass and fastq_fail folders (standard QC in basecalling has organised them into passed and failed reads based on their quality).
The non-basecalled, raw electrical signals are saved in pod5 files and can be found in the folder pod5
2. Basecalling using Dorado
Migrate to the folder containing data from MinKNOW - the folder will look similar to the image below, and the name of the folder will contain information on the date of the run and the flow cell used.
- Do not change the name of this folder, it will help us to trace back results to each Nanopore run.
Within this folder, create a folder for the new basecalling output
mkdir dorado_sup_basecall
mkdir = make a new directory
Migrate into yor new directory
cd dorado_sup_basecall
cd = change directory
Use dorado to start basecalling
Note: Basecalling is a slow process that takes a lot of memory, this may not run if other large process are running in the server at the same time
dorado basecaller \
--min-qscore 10 \
--kit-name SQK-NBD114-96 \
sup ../pod5 > Filename_output.bam
Note:
- The kit-name in your sample sheet will need to match the exact kit index that dorado is looking for, from the following list: EXP-NBD103 EXP-NBD104 EXP-NBD114 EXP-NBD114-24 EXP-NBD196 EXP-PBC001 EXP-PBC096 SQK-16S024 SQK-16S114-24 SQK-LWB001 SQK-MLK111-96-XL SQK-MLK114-96-XL SQK-NBD111-24 SQK-NBD111-96 SQK-NBD114-24 SQK-NBD114-96 SQK-PBK004 SQK-PCB109 SQK-PCB110 SQK-PCB111-24 SQK-PCB114-24 SQK-RAB201 SQK-RAB204 SQK-RBK001 SQK-RBK004 SQK-RBK110-96 SQK-RBK111-24 SQK-RBK111-96 SQK-RBK114-24 SQK-RBK114-96 SQK-RLB001 SQK-RPB004 SQK-RPB114-24 TWIST-16-UDI TWIST-96A-UDI VSK-PTC001 VSK-VMK001 VSK-VMK004 VSK-VPS001
- The above code uses the ‘sup’ basecalling alogrithm, this is the most accurate, but slowest
This creates a combined file in BAM format, that contains all of your sequence data reads in one file, each read has a header line with information, where the barcode within the reads is listed.
We can use this file to split all of the data into one file per Nanopore barcode to split our isolates into their separate pools.
This process is known as DEMULTIPLEXING
3) Demultiplexing reads using Dorado
Demultiplexing splits your sequence data (individual reads) into seperate files so that each isolate (with its own unique nanopore barcode) has all of the data in one file/folder
dorado demux --output-dir ./classified_demux --no-classify ./
Within this folder (classified_demux), you will see individual bam files for each barcode.
Like below:
You will see barcodes that you did not use, do not worry, these files are likely empty or rubbish and can be ignored
4) Map your reads (bam files previously created) to a reference genome
I have copied references genomes into your server account, within the folder, ‘genomes’, see below:
Note you may have one or many of these files in your genome folder, all that you need for mapping is the GENOMENAME.fasta file
You are currently in the ‘dorado_sup’ directory, which contained your barcode-sorted basecalled reads
It is a good idea to create a separate directory for the mapped reads
move up a directory
cd ../
create a new directory for mapped reads
mkdir mapping
migrate to the mapping directory
cd mapping
- Use Minimap2 to align your individual reads to the reference genome
Minimap2 is a sequence aligner that is recommended for aligning long reads created by Oxford Nanopore sequencing technologies.
Here is the reference for Minimap2, an here is a tutorial page.
Minimap2 should already be installed in your server account, please let me know if there are issues running Minimap2
minimap2 -ax map-ont --secondary=no ~/PATH/TO/GENOME/REFERENCE.fa ~/PATH/TO/BARCODED/BAMs/classified_demux/XXX_barcode01.bam > barcode01_aligned.sam
You will have to do this individually for each barcode file, and remember to change the names of the input and output files accordingly
minimap2 can take bam files or fastq files as input, if you want to convert your bam file outputted by Dorado into fastq, you can use: bedtools bamtofastq -i File.bam -fq File.fastq
- Format your individual aligned files
Alignment files are outputted as SAM files, these files are large - it is best to convert these to BAM files and then DELETE YOUR SAM FILES
All of the following steps with utilise samtools, a useful ackage for working with SAM and BAM files after sequencing.
- Convert SAM file to BAM file
samtools view -Sb -o barcode01_aligned.bam barcode01_aligned.sam
-Sb indicates that the input is a SAM file and the output is a BAM file Change the name of each file accordingly depending on what sample you are working on
- Sort your BAM file
Sorting reorders the reads in your alignment based on their position when aligned to the reference genome
samtools sort -O bam -o barcode01_aligned.sorted.bam barcode01_aligned.bam
-O bam specifies the output as a BAM file -o specified the name of the output file
- Index your sorted BAM file
Indexing creates a ‘contents page’ of the aligned file, which is needed by many downstream applications, including Tablet
The index file created will be called FILENAME.bam.bai
samtools index barcode01_aligned.sorted.bam
Check for generation of the indexed file using ‘ls’ to list all files in the current directory
6) Checking mapping statistics / QC
Check mapping statistics using samtools bamstats
samtools flagstat FILENAME.bam
This will output the number of reads and the percentage of reads mapping to the reference genome for a basic measure of how much P. knowlesi data you have
Genome coverage plots - to do after Easter!
7) Visualising your genome in Tablet
- Install tablet
Once tablet is installed, you will need to copy uour BAM file (filename.bam) and indexed BAM file (FILENAME.bam.bai) over into your PC where tablet is installed
First copy over the bam file…
scp USERNAME@10.18.0.25:FOLDER/FILE/PATH/TO/BAMFILE/FILENAME.bam ./
Then, copy over the index file:
scp USERNAME@10.18.0.25:FOLDER/FILE/PATH/TO/BAMFILE/FILENAME.bam.bai ./
Now you can open Tablet in your local PC, and upload the relevant BAM file and your reference genome - the reference genome is already downloaded into your PC!
8) Calling variants and looking for SNPs / INDELs
- TO DO!!!