This documentation was last modified on: June 24, 2019 at 21:54:42
Base calling of Nanopore data is notoriously slow when performed using CPUs, so there has been a large push towards implementing GPU-based software. This document details benchmarking undertaken across two different environments testing the Guppy
base caller for Oxford Nanopore sequencing data.
The benchmarking performed here has been done on a desktop workstation (running Siduction) and a headless rack server (running CentOS 7.6).
Linux workstation is a Lenovo ThinkStation P920 (running Siduction):
CPU: (2x) 12 core Intel Xeon Gold 5118 (48 threads)
RAM: 256GB
GFX: Nvidia Titan RTX
SSD: 1TB
HDD: 20TB
Rack server (CentOS 7.6): [specs to be added]
CPU: Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz (64 threads)
RAM: 512GB (?)
GFX: (2x) Nvidia Tesla V100
Here is a quick feature breakdown of the two Nvidia cards used (Tesla V100 and Titan RTX):
Feature | Telsa V100 | Titan RTX |
---|---|---|
Pipelines (Cuda cores) | 5120 | 4608 |
Core clock speed | 1246 MHz | 1350 MHz |
Boost Clock | 1380 MHz | 1770 MHz |
Memory | 32GB | 24GB |
Transistor count | 21,100 million | 18,600 million |
Manufacturing process technology | 12 nm | 12 nm |
Power consumption (TDP) | 250 Watt | 280 Watt |
Nvidia drivers | 410.48 | 418.56 |
Cuda version | 10.0 | 10.1 |
Before adopting GPU-enabled callers such as Guppy
calling was performed using CPUs. This would usually take >1 day to perform.
Note: can try to source actual numbers here as that would be useful.
There are specific configuration files provided with Guppy
that can be used to set optimum paramters for specific conditions.
I chose to use high accuracy flip-flop fast calling for both environments:
dna_r9.4.1_450bps_fast.cfg
dna_r9.4.1_450bps_fast_prom.cfg
UPDATE [2019-06-21]: currently benchmarking using the high accuracy calling model (dna_r9.4.1_450bps_hac.cfg
). The document will be updated with results as they become available.
UPDATE [2019-06-17]: I have recently learnt that configuration files are tailored to specific Nanopore machines (i.e. MinION, PromethION) and not the graphics card set up. All guppy
runs should be performed using the ‘standard’ config file (dna_r9.4.1_450bps_fast.cfg
). The benchmarking will be amended in this light. I will keep the current information as run times shouldn’t differ much between the configs, it’s mainly the accuracy of the base calling that will be influenced.
Note: config file contents are included at the end of this document.
From what I can glean from the documentation the second config file (dna_r9.4.1_450bps_fast.cfg
) is modified for V100 cards, so I thought it would make sense to run this config here for those cards.
I will test the ‘base’ config file across both V100 cards as well to see if there is in fact any difference in calling speed resulting from different configurations.
TODO: look into the individual parameters and see if there is anything that we can tweak to eek out more performance. From what I’ve read Nanopore only offically support about 3 GFX cards with Guppy
(V100, Geforce 1080 / 1080Ti, and Jetson platform), however they state that it’s merely because of the potential for many different Nvidia/Cuda driver setups across Linux distros. There are many cases of people using a wide range of cards successfully, though they don’t seem to publish/talk about their config/parameter settings!
This testing was performed on a recent minION run (~24hrs) that produced ~20Gb (~150 gigs) of sequence data. This would typically take longer than a day to basecall using prior methods.
run | CPUs | Titan RTX | single V100 | double V100 |
---|---|---|---|---|
[fast] test (small data) | ~50 secs | 0.6 secs | 1.5 secs | 1.7 secs |
[fast] full data | >1 day | 44.67 mins* | 44.88 mins | 45.20 mins |
[fast] split data+ (only V100s) | - | - | - | 23 mins |
[hac] full data | - | ~3hr 24mins | ~2hrs 22mins | - |
[hac] split data+ (only V100s) | - | - | - | ~1hr 18mins |
(note: fast = fast base calling, hac = high accuracy base calling, - = did not test)
* The first run uning fast calling took 63.98 mins but for some reason subsequent runs seem to be significantly quicker. I’ve triple checked the code and it is no different, other than the first run had slightly less data (didn’t include the folder with the small test data set)!
+ We took a ‘brute force’ approach and split the data into two even sets, sending one to each V100.
The great news is that we’re now base calling significantly faster than previously!
[fast mode] We have reduced it from >1 day down to currently ~45 mins on a single card. It turns out that guppy
doesn’t scale across multiple GPUs, but by spliting the data and sending smaller chunks off to each GPU we finished the run in 23 mins - so that’s halved again!
[hac mode] We have now completed testing in high accuracy mode. As expected it is taking longer to base call, however it is still very fast (see table above for full details). We are currently seeing a run time of ~2hr 20mins using a single V100, which we are able to cut to around 1hr 20mins when splitting the data between two V100s. Interestingly, the Titan RTX is significantly slower in high accuracy mode (~3hrs 20mins), where as before it was slightly ahead of a single V100 in fast mode. This leads us to believe that high accuracy mode likely leverages more cuda cores for the base calling - would be great to have this confirmed.
So all in all this is a great increase in efficiency and will allow us to get to aligning and results much faster than before, and with much less CPU overhead.
I believe that there is still a lot of tweaking that can be done to increase performance, I find it odd that the two V100s aren’t scacling when I have read of different institutes scalling it turns out these places must just be doing what we’ve done above, ‘manually’ splitting data and sharing across the cards to get the speed increases.Guppy
across >8 V100s…
The things that have me a little perplexed:
guppy
doesn’t appear to be optimised for any card in particular.guppy
is hardcoded to work with X Cuda cores??The below include log outputs for each ‘run’ with more detail for those that are interested.
# Titan RTX first run
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -i 20190606_0042_yersinia/fast5/ -r -s yersinia_basecalled/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: 20190606_0042_yersinia/fast5/
save path: yersinia_basecalled/
chunk size: 1000
chunks per runner: 20
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 20
Found 1214463 fast5 files to process.
Init time: 8112 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 3838880 ms, Samples called: 117589088085, samples/s: 3.06311e+07
Finishing up any open output files.
Basecalling completed successfully.
On this run I forgot to add the flag to compress the output, it resulted in doubling the time taken to complete base calling!
# Titan RTX second run (no compression)
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0"
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: raw_data/
save path: yersinia_basecalled_20190614/
chunk size: 1000
chunks per runner: 20
records per file: 4000
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 20
Found 1215050 fast5 files to process.
Init time: 6899 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 5982105 ms, Samples called: 117598792614, samples/s: 1.96584e+07
Finishing up any open output files.
Basecalling completed successfully.
# Titan RTX third run (compression)
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: raw_data/
save path: yersinia_basecalled_20190614/
chunk size: 1000
chunks per runner: 20
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 20
Found 1215050 fast5 files to process.
Init time: 7655 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2680343 ms, Samples called: 117598792614, samples/s: 4.38745e+07
Finishing up any open output files.
Basecalling completed successfully.
I’m really not sure how this is increasing in performance?? This run took 41.65 mins…
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614_run2/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: raw_data/
save path: yersinia_basecalled_20190614_run2/
chunk size: 1000
chunks per runner: 20
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 20
Found 1215050 fast5 files to process.
Init time: 6335 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2496930 ms, Samples called: 117598792614, samples/s: 4.70974e+07
Finishing up any open output files.
Basecalling completed successfully.
This time around we’re giving the high accuracy base calling configuration a whirl:
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190621_hac/ -c dna_r9.4.1_450bps_hac.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: raw_data/
save path: yersinia_basecalled_20190621_hac/
chunk size: 1000
chunks per runner: 1000
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 2
Found 1215050 fast5 files to process.
Init time: 12755 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 12227895 ms, Samples called: 117512178322, samples/s: 9.61017e+06
Finishing up any open output files.
Basecalling completed successfully.
So total calling time was ~3hr 24mins.
Looking to optimise based on information from this gist: https://gist.github.com/disulfidebond/00ff5a6f84a0a81057c6e5817c540569
These optimisations were made for a 2080 Ti, so obviously there is more head room on a Titan RTX. Will try this run and compare against the default config, that way we can begin to determine best params to tweak for performance.
# default high accuracy mode
~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
-r -i raw_data/ \
-s yersinia_basecalled_20190621_hac/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:0" \
--compress_fastq
# param tweaking in high accuracy mode
~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
-r -i raw_data/ \
-s yersinia_basecalled_20190621_hac2/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:0" \
--compress_fastq \
--num_callers 14 --gpu_runners_per_device 8 \
--chunks_per_runner 768 --chunk_size 500
Run with para tweaks:
# param tweaking in high accuracy mode
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
> -r -i raw_data/ \
> -s yersinia_basecalled_20190621_hac2/ \
> -c dna_r9.4.1_450bps_hac.cfg \
> -x "cuda:0" \
> --compress_fastq \
> --num_callers 14 --gpu_runners_per_device 8 \
> --chunks_per_runner 768 --chunk_size 500
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: raw_data/
save path: yersinia_basecalled_20190621_hac2/
chunk size: 500
chunks per runner: 768
records per file: 4000
fastq compression: ON
num basecallers: 14
gpu device: cuda:0
kernel path:
runners per device: 8
Found 1215050 fast5 files to process.
Init time: 11052 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 11162858 ms, Samples called: 117512178322, samples/s: 1.05271e+07
Finishing up any open output files.
There is a slight speed up using the modified parameters, now called in ~3hrs 6mins.
miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi
Fri Jun 14 11:52:42 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:18:00.0 On | N/A |
| 90% 87C P2 190W / 280W | 988MiB / 24185MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2196 G cinnamon 179MiB |
| 0 17313 C .../guppy/ont-guppy/bin/./guppy_basecaller 533MiB |
| 0 29246 G /usr/lib/firefox/firefox 3MiB |
| 0 29272 G /usr/lib/firefox/firefox 3MiB |
| 0 31456 G ...-token=F43A3D74BB834EC5E6F306FFD3FF6D0F 45MiB |
| 0 48087 G /usr/lib/xorg/Xorg 209MiB |
+-----------------------------------------------------------------------------+
miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi
Fri Jun 21 14:46:09 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:18:00.0 On | N/A |
| 93% 87C P2 201W / 280W | 4681MiB / 24185MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2196 G cinnamon 186MiB |
| 0 29094 G /usr/lib/firefox/firefox 3MiB |
| 0 29246 G /usr/lib/firefox/firefox 3MiB |
| 0 29272 G /usr/lib/firefox/firefox 3MiB |
| 0 34972 G ...-token=70703895EF88CE4F3C785AA29591F246 57MiB |
| 0 46713 C .../guppy/ont-guppy/bin/./guppy_basecaller 4219MiB |
| 0 48087 G /usr/lib/xorg/Xorg 193MiB |
+-----------------------------------------------------------------------------+
--num_callers 14 --gpu_runners_per_device 8 \
--chunks_per_runner 768 --chunk_size 500
miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi
Fri Jun 21 19:52:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN RTX On | 00000000:18:00.0 On | N/A |
| 99% 88C P2 215W / 280W | 6849MiB / 24185MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2196 G cinnamon 186MiB |
| 0 6911 C .../guppy/ont-guppy/bin/./guppy_basecaller 6387MiB |
| 0 29094 G /usr/lib/firefox/firefox 3MiB |
| 0 29246 G /usr/lib/firefox/firefox 3MiB |
| 0 29272 G /usr/lib/firefox/firefox 3MiB |
| 0 34972 G ...-token=70703895EF88CE4F3C785AA29591F246 57MiB |
| 0 48087 G /usr/lib/xorg/Xorg 193MiB |
+-----------------------------------------------------------------------------+
screen -r 190473.pts-0.kscprod-data1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_GPU0 -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path: /scratch/20190606_yersinia/
save path: /scratch/yersinia_basecalled_GPU0
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 8
Found 1215050 fast5 files to process.
Init time: 13980 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2706948 ms, Samples called: 117598792614, samples/s: 4.34433e+07
Finishing up any open output files.
Basecalling completed successfully.
screen -r 40715.pts-7.kscprod-data1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_GPU1 -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path: /scratch/20190606_yersinia/
save path: /scratch/yersinia_basecalled_GPU1
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:1
kernel path:
runners per device: 8
Found 1215050 fast5 files to process.
Init time: 11944 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2693057 ms, Samples called: 117598792614, samples/s: 4.36674e+07
Finishing up any open output files.
Basecalling completed successfully.
This run was performed in /DSC/minion
, the drive is mount NFS storage so i/o might be a limitation.
# Tesla V100 (x2) first run (on /DSC/)
orac$ ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s yersinia_called/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path: /scratch/20190606_yersinia/
save path: yersinia_called/
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0 cuda:1
kernel path:
runners per device: 8
Found 1215050 fast5 files to process.
Init time: 14839 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2712127 ms, Samples called: 117598792614, samples/s: 4.33604e+07
Finishing up any open output files.
Basecalling completed successfully.
This run was performed in /scratch
, this is an array of SSD storage (data is striped over 8 x 800GB SSDs) so i/o should be less of an issue.
# Tesla V100 (x2) first run (on /scratch/)
orac$ ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
bash: ont-guppy/bin/./guppy_basecaller: No such file or directory
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path: /scratch/20190606_yersinia/
save path: /scratch/yersinia_basecalled/
chunk size: 1000
chunks per runner: 256
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0 cuda:1
kernel path:
runners per device: 8
Found 1215050 fast5 files to process.
Init time: 14839 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2736231 ms, Samples called: 117598792614, samples/s: 4.29784e+07
Finishing up any open output files.
Basecalling completed successfully.
This run was performed across both cards using the non-PromethION optimised (read V100 optimised) configuration file (dna_r9.4.1_450bps_fast.cfg
).
It appears as if there is no difference in time between the two config files:
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_nonprom/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path: /scratch/20190606_yersinia/
save path: /scratch/yersinia_basecalled_nonprom/
chunk size: 1000
chunks per runner: 20
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0 cuda:1
kernel path:
runners per device: 20
Found 1215050 fast5 files to process.
Init time: 14798 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2718941 ms, Samples called: 117598792614, samples/s: 4.32517e+07
Finishing up any open output files.
Basecalling completed successfully.
It appears that despite usage info (power, RAM, GPU cycles) guppy
doesn’t scale across multiple cards for the same job.
Here is a link in the ‘wild’: https://bioinformatics.stackexchange.com/questions/8622/using-guppy-basecaller-on-node-with-2-gpus
So if this is the case it means that others out there already implementing multi-GPUs set ups must be spliting the run across the cards in a more ‘manual’ fashion. So, we’ll give it a whirl…
Bit of a ‘brute force’ approach…
# split the data into 2
ls -dv /scratch/20190606_yersinia/20190606_0042_yersinia/fast5/* | head -n 153 | xargs -i cp -r "{}" /scratch/20190606_yersinia/split1/
ls -dv /scratch/20190606_yersinia/20190606_0042_yersinia/fast5/* | tail -n 152 | xargs -i cp -r "{}" /scratch/20190606_yersinia/split2/
Now run across the 2 GPUs at the same time:
# GPU0
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split1 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
# GPU1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split2 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:1" --compress_fastq
# run this for parallel in bash
/DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split1 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq & /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split2 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:1" --compress_fastq &
^might be a ‘nicer’ way to submit both jobs at once to each GPU…
Results:
# GPU0
Caller time: 1305472 ms, Samples called: 53663662909, samples/s: 4.11067e+07
Finishing up any open output files.
Basecalling completed successfully.
# GPU1
Caller time: 1439230 ms, Samples called: 64310185363, samples/s: 4.46837e+07
Finishing up any open output files.
Basecalling completed successfully.
As expected, when split evenly across input files the time is halved (~23 mins).
Now run across the 2 GPUs at the same time:
# running in screen
screen -r 98630.pts-7.kscprod-data1
# GPU0
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split1 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:0" --compress_fastq
# GPU1
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split2 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:1" --compress_fastq
# run this for parallel in bash
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split1 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
-c dna_r9.4.1_450bps_hac.cfg -x "cuda:0" \
--compress_fastq & \
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split2 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:1" --compress_fastq &
NOTE: it seems you need to specify different output directories, the files created overwrite each other… need to invesitage this! [TODO]
Results:
# GPU0
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split2 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: /scratch/20190606_yersinia/split2
save path: /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/
chunk size: 1000
chunks per runner: 1000
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:1
kernel path:
runners per device: 2
Found 606463 fast5 files to process.
Init time: 8170 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 4695179 ms, Samples called: 64269239017, samples/s: 1.36883e+07
Finishing up any open output files.
Basecalling completed successfully.
# GPU1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller \
-r -i /scratch/20190606_yersinia/split1 \
-s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
-c dna_r9.4.1_450bps_hac.cfg \
-x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: /scratch/20190606_yersinia/split1
save path: /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/
chunk size: 1000
chunks per runner: 1000
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 2
Found 612000 fast5 files to process.
Init time: 10233 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 3893411 ms, Samples called: 53617652165, samples/s: 1.37714e+07
Finishing up any open output files.
Basecalling completed successfully.
Time taken for high accuracy calling when split across 2 Telsa V100’s is ~1hr 18mins 15secs.
orac$ nvidia-smi
Mon Jun 24 13:05:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:2F:00.0 Off | 0 |
| N/A 71C P0 218W / 250W | 4480MiB / 32480MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 11MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 276690 C ...minion/ont-guppy/bin/./guppy_basecaller 4469MiB |
+-----------------------------------------------------------------------------+
config file: /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file: /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path: /scratch/20190606_yersinia/20190606_0042_yersinia
save path: /scratch/20190606_yersinia/yersinia_basecalled_miles_hac_gpu0/
chunk size: 1000
chunks per runner: 1000
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: cuda:0
kernel path:
runners per device: 2
Found 1214463 fast5 files to process.
Init time: 24971 ms
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 8550005 ms, Samples called: 117502465139, samples/s: 1.3743e+07
Finishing up any open output files.
Basecalling completed successfully.
orac$ nvidia-smi
Fri Jun 14 11:51:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:2F:00.0 Off | 0 |
| N/A 59C P0 54W / 250W | 2074MiB / 32480MiB | 77% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 58C P0 64W / 250W | 2074MiB / 32480MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 39405 C ...minion/ont-guppy/bin/./guppy_basecaller 2063MiB |
| 1 39405 C ...minion/ont-guppy/bin/./guppy_basecaller 2063MiB |
+-----------------------------------------------------------------------------+
nvidia-smi
Mon Jun 24 10:52:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:2F:00.0 Off | 0 |
| N/A 70C P0 203W / 250W | 4480MiB / 32480MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 71C P0 227W / 250W | 4480MiB / 32480MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 231682 C ...minion/ont-guppy/bin/./guppy_basecaller 4469MiB |
| 1 231993 C ...minion/ont-guppy/bin/./guppy_basecaller 4469MiB |
+-----------------------------------------------------------------------------+
Different physical GPUs but same architecture, does it make a difference when base calling?
Total output looks the same:
orac$ du yersinia_basecalled_GPU0 -ha | tail -n 1
9.7G yersinia_basecalled_GPU0
orac$ du yersinia_basecalled_GPU1 -ha | tail -n 1
9.7G yersinia_basecalled_GPU1
It appears that files differ even though they have the same name:
orac$ diff -qr yersinia_basecalled_GPU0/ yersinia_basecalled_GPU1/ | grep 'differ' | wc -l
311
dna_r9.4.1_450bps_fast.cfg
# Basic configuration file for ONT Guppy basecaller software.
# Data trimming.
trim_strategy = dna
trim_threshold = 2.5
trim_min_events = 3
# Basecalling.
model_file = template_r9.4.1_450bps_fast.jsn
chunk_size = 1000
gpu_runners_per_device = 20
chunks_per_runner = 20
chunks_per_caller = 10000
overlap = 50
qscore_offset = -0.098
qscore_scale = 0.935
builtin_scripts = 1
# Calibration strand detection
calib_reference = lambda_3.6kb.fasta
calib_min_sequence_length = 3000
calib_max_sequence_length = 3800
calib_min_coverage = 0.6
# Output.
records_per_fastq = 4000
min_qscore = 7.0
# Telemetry
ping_url = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration = 60
dna_r9.4.1_450bps_fast_prom.cfg
# Basic configuration file for ONT Guppy basecaller software.
# Data trimming.
trim_strategy = dna
trim_threshold = 2.5
trim_min_events = 3
# Basecalling.
model_file = template_r9.4.1_450bps_fast_prom.jsn
chunk_size = 1000
gpu_runners_per_device = 8
chunks_per_runner = 256
chunks_per_caller = 10000
overlap = 50
qscore_offset = 0.127
qscore_scale = 0.958
builtin_scripts = 1
# Calibration strand detection
calib_reference = lambda_3.6kb.fasta
calib_min_sequence_length = 3000
calib_max_sequence_length = 3800
calib_min_coverage = 0.6
# Output.
records_per_fastq = 4000
min_qscore = 7.0
# Telemetry
ping_url = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration = 60
dna_r9.4.1_450bps_hac.cfg
This configuration implements high accuracy base calling.
# Basic configuration file for ONT Guppy basecaller software.
# Compatibility
compatible_flowcells = FLO-FLG001,FLO-MIN106
compatible_kits = SQK-CAS109,SQK-DCS108,SQK-DCS109,SQK-LRK001,SQK-LSK108,SQK-LSK109,SQK-LSK109-XL,SQK-LWP001,SQK-PCS108,SQK-PCS109,SQK-PSK004,SQK-RAD002,SQK-RAD003,SQK-RAD004,SQK-RAS201,SQK-RLI001,VSK-VBK001,VSK-VSK001,VSK-VSK002
compatible_kits_with_barcoding = SQK-16S024,SQK-PCB109,SQK-RBK001,SQK-RBK004,SQK-RLB001,SQK-LWB001,SQK-PBK004,SQK-RAB201,SQK-RAB204,SQK-RPB004,VSK-VMK001,VSK-VMK002
# Data trimming.
trim_strategy = dna
trim_threshold = 2.5
trim_min_events = 3
# Basecalling.
model_file = template_r9.4.1_450bps_hac.jsn
chunk_size = 1000
gpu_runners_per_device = 2
chunks_per_runner = 1000
chunks_per_caller = 10000
overlap = 50
qscore_offset = 0.25
qscore_scale = 0.91
builtin_scripts = 1
# Calibration strand detection
calib_reference = lambda_3.6kb.fasta
calib_min_sequence_length = 3000
calib_max_sequence_length = 3800
calib_min_coverage = 0.6
# Output.
records_per_fastq = 4000
min_qscore = 7.0
# Telemetry
ping_url = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration = 60
revtrunc <- function(x) { sign(x) * (x - floor(x)) }
callerTime <- 4695179
timeHours <- (callerTime/1000) / 60 / 60
timeMins <- revtrunc(timeHours) * 60
timeSecs <- round(revtrunc(timeMins) * 60)
paste0(floor(timeHours), "hrs ", floor(timeMins), "mins ", timeSecs, "secs ")
A work by Miles Benton
miles.benton@esr.cri.nz