Using Nanopore sequene data to check specific genes
Using Nanopore sequene data to check specific genes
This page will guide you through the steps to check the sequence of a parasite line
The basic steps involve:
1. Basecalling your sequence data using Dorado
2. Demultiplexing your reads into FASTQ sequence files for each barcode
3. Mapping individual FASTQ files to your reference genome
4. Searching individual genes for coverage / SNPs
1. Basecalling using Dorado
Dorado should already be installed in your server account - please ask me to have a look if you have any issues running Dorado
i) Migrate to the folder containing data from MinKNOW
The folder will look similar to the image below, and the name of the folder will contian information on the date of the run and the flow cell used.
- Do not change the name of this folder, it will help us to trace back results to each Nanopore run.
ii) Within this folder, create a folder for the new basecalling output
mkdir dorado_sup_basecall
mkdir = make a new directory
iii) Migrate into yor new directory
cd dorado_sup_basecall
cd = change directory
iv) Use dorado to start basecalling
Note: Basecalling is a slow process that takes a lot of memory, this may not run if other large process are running in the server at the same time
dorado basecaller \
--min-qscore 10 \
--kit-name SQK-NBD114-96 \
sup ../pod5 > Filename_output.bam
Note:
- change the kit-name to the relevant kit name, is it NBD or RBK??
- The above code uses the ‘sup’ basecalling alogrithm, this is the most accurate, but slowest
This creates a combined file in BAM format, that contains all of your sequence data reads in one file, each read has a header line with information, where the barcode within the reads is listed.
We can use this file to split all of the data into one file per Nanopore barcode to split our isolates into their separate pools.
This process is known as DEMULTIPLEXING
3) Demultiplexing reads
i) Use dorado demux to demultiplex
dorado demux --output-dir ./classified_demux --no-classify ./
Within this folder (classified_demux), you will see individual bam files for each barcode.
Like below:
You will see barcodes that you did not use, do not worry, these files are likely empty or rubbish and can be ignored
ii) format your demultiplexing output
First create a folder for the barcodes that you have used
mkdir barcodes_used
> Create a list within this folder of the names of the barcodes that you have used (barcodes_used.txt)
An example of this file is:
We will use vim a text editor to create this file
vim barcodes_used.txt
This will open a blank text document, press ‘i’ to enter your text
List each barcode used in the correct format (matching the end of your filenames) on a new line - do not leave whitespace
To exit vim:
1. Press ESc
2. Type ‘:wq!:’
3. Press ENTER
> Move your files into the barcodes used folder
cat ./barcodes_used/barcodes_used.txt | parallel -j 1 "mv ./PATH/TO/FILE/{}.bam ./barcodes_used"
5) Map your reads (bam files previously created) to a reference genome
I have copied references genomes into your server account, within the folder, ‘genomes’, see below:
i) Create a mapping directory
You are currently in the ‘dorado_sup’ directory, which contained your barcode-sorted basecalled reads
It is a good idea to create a separate directory for the mapped reads
a) move up a directory
cd ../
b) create a new directory for mapped reads
mkdir mapping
c) migrate to the mapping directory
cd mapping
ii) Use Minimap2 to align your individual reads to the reference genome
Minimap2 is a sequence aligner that is recommended for aligning long reads created by Oxford Nanopore sequencing technologies.
Here is the reference for Minimap2, an here is a tutorial page.
Minimap2 should already be installed in your server account, please let me know if there are issues running Minimap2
minimap2 -ax map-ont ~/PATH/TO/GENOME/REFERENCE.fa ~/PATH/TO/BARCODED/BAMs/classified_demux/XXX_barcode01.bam > barcode01_aligned.sam
You will have to do this individually for each barcode file, and remember to change the names of the input and output files accordingly
iii) format your individual aligned files
Alignment files are outputted as SAM files, these files are large - it is best to convert these to BAM files and then DELETE YOUR SAM FILES
All of the following steps with utilise samtools, a useful ackage for working with SAM and BAM files after sequencing.
a) convert SAM file to BAM file
samtools view -Sb -o barcode01_aligned.bam barcode01_aligned.sam
-Sb indicates that the input is a SAM file and the output is a BAM file Change the name of each file accordingly depending on what sample you are working on
b) sort your BAM file
Sorting reorders the reads in your alignment based on their position when aligned to the reference genome
samtools sort -O bam -o barcode01_aligned.sorted.bam barcode01_aligned.bam
-O bam specifies the output as a BAM file -o specified the name of the output file
c) index your sorted BAM file
Indexing creates a ‘contents page’ of the aligned file, which is needed by many downstream applications, including tablet
The index file created will be called FILENAME.bam.bai
samtools index barcode01_aligned.sorted.bam
Check for generation of the indexed file using ‘ls’ to list all files in the current directory
6) Checking mapping statistics / QC
Check mapping statistics using samtools bamstats
samtools flagstat FILENAME.bam
This will output the number of reads and the percentage of reads mapping to the reference genome for a basic measure of how much P. knowlesi data you have
Genome coverage plots - to do after Easter!
7) Visualising your genome in Tablet
- Install tablet
Once tablet is installed, you will need to copy uour BAM file (filename.bam) and indexed BAM file (FILENAME.bam.bai) over into your PC where tablet is installed
First copy over the bam file…
scp USERNAME@10.18.0.25:FOLDER/FILE/PATH/TO/BAMFILE/FILENAME.bam ./
Then, copy over the inde file:
scp USERNAME@10.18.0.25:FOLDER/FILE/PATH/TO/BAMFILE/FILENAME.bam.bai ./
Now you can open Tablet in your local PC, and upload the relevant BAM file and your reference genome - the reference genome is already downloaded into your PC!
Below - For Amy to edit after easter!!
7) Calling variants and looking for SNPs / INDELs
- check the best variant caller for nanopore