This documentation was last modified on: June 24, 2019 at 21:54:42

Benchmarking Guppy base calling

Base calling of Nanopore data is notoriously slow when performed using CPUs, so there has been a large push towards implementing GPU-based software. This document details benchmarking undertaken across two different environments testing the Guppy base caller for Oxford Nanopore sequencing data.

The benchmarking performed here has been done on a desktop workstation (running Siduction) and a headless rack server (running CentOS 7.6).

Linux workstation is a Lenovo ThinkStation P920 (running Siduction):

CPU: (2x) 12 core Intel Xeon Gold 5118 (48 threads)
RAM: 256GB
GFX: Nvidia Titan RTX
SSD: 1TB
HDD: 20TB

Rack server (CentOS 7.6): [specs to be added]

CPU: Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz (64 threads)
RAM: 512GB (?)
GFX: (2x) Nvidia Tesla V100

GFX Hardware

Here is a quick feature breakdown of the two Nvidia cards used (Tesla V100 and Titan RTX):

Feature	Telsa V100	Titan RTX
Pipelines (Cuda cores)	5120	4608
Core clock speed	1246 MHz	1350 MHz
Boost Clock	1380 MHz	1770 MHz
Memory	32GB	24GB
Transistor count	21,100 million	18,600 million
Manufacturing process technology	12 nm	12 nm
Power consumption (TDP)	250 Watt	280 Watt
Nvidia drivers	410.48	418.56
Cuda version	10.0	10.1

CPU calling

Before adopting GPU-enabled callers such as Guppy calling was performed using CPUs. This would usually take >1 day to perform.

Note: can try to source actual numbers here as that would be useful.

note on cfg files

There are specific configuration files provided with Guppy that can be used to set optimum paramters for specific conditions.

I chose to use ~~high accuracy flip-flop~~ fast calling for both environments:

for the Titan RTX: dna_r9.4.1_450bps_fast.cfg
~~for the V100s: dna_r9.4.1_450bps_fast_prom.cfg~~

UPDATE [2019-06-21]: currently benchmarking using the high accuracy calling model (dna_r9.4.1_450bps_hac.cfg). The document will be updated with results as they become available.

UPDATE [2019-06-17]: I have recently learnt that configuration files are tailored to specific Nanopore machines (i.e. MinION, PromethION) and not the graphics card set up. All guppy runs should be performed using the ‘standard’ config file (dna_r9.4.1_450bps_fast.cfg). The benchmarking will be amended in this light. I will keep the current information as run times shouldn’t differ much between the configs, it’s mainly the accuracy of the base calling that will be influenced.

Note: config file contents are included at the end of this document.

~~From what I can glean from the documentation the second config file (dna_r9.4.1_450bps_fast.cfg) is modified for V100 cards, so I thought it would make sense to run this config here for those cards.~~

~~I will test the ‘base’ config file across both V100 cards as well to see if there is in fact any difference in calling speed resulting from different configurations.~~

TODO: look into the individual parameters and see if there is anything that we can tweak to eek out more performance. From what I’ve read Nanopore only offically support about 3 GFX cards with Guppy (V100, Geforce 1080 / 1080Ti, and Jetson platform), however they state that it’s merely because of the potential for many different Nvidia/Cuda driver setups across Linux distros. There are many cases of people using a wide range of cards successfully, though they don’t seem to publish/talk about their config/parameter settings!

Summary (24-June-2019)

This testing was performed on a recent minION run (~24hrs) that produced ~20Gb (~150 gigs) of sequence data. This would typically take longer than a day to basecall using prior methods.

Results

run	CPUs	Titan RTX	single V100	double V100
[fast] test (small data)	~50 secs	0.6 secs	1.5 secs	1.7 secs
[fast] full data	>1 day	44.67 mins^*	44.88 mins	45.20 mins
[fast] split data⁺ (only V100s)	-	-	-	23 mins
[hac] full data	-	~3hr 24mins	~2hrs 22mins	-
[hac] split data⁺ (only V100s)	-	-	-	~1hr 18mins

(note: fast = fast base calling, hac = high accuracy base calling, - = did not test)

^* The first run uning fast calling took 63.98 mins but for some reason subsequent runs seem to be significantly quicker. I’ve triple checked the code and it is no different, other than the first run had slightly less data (didn’t include the folder with the small test data set)!

⁺ We took a ‘brute force’ approach and split the data into two even sets, sending one to each V100.

Conclusion?

The great news is that we’re now base calling significantly faster than previously!

[fast mode] We have reduced it from >1 day down to currently ~45 mins on a single card. It turns out that guppy doesn’t scale across multiple GPUs, but by spliting the data and sending smaller chunks off to each GPU we finished the run in 23 mins - so that’s halved again!

[hac mode] We have now completed testing in high accuracy mode. As expected it is taking longer to base call, however it is still very fast (see table above for full details). We are currently seeing a run time of ~2hr 20mins using a single V100, which we are able to cut to around 1hr 20mins when splitting the data between two V100s. Interestingly, the Titan RTX is significantly slower in high accuracy mode (~3hrs 20mins), where as before it was slightly ahead of a single V100 in fast mode. This leads us to believe that high accuracy mode likely leverages more cuda cores for the base calling - would be great to have this confirmed.

So all in all this is a great increase in efficiency and will allow us to get to aligning and results much faster than before, and with much less CPU overhead.

I believe that there is still a lot of tweaking that can be done to increase performance, ~~I find it odd that the two V100s aren’t scacling when I have read of different institutes scalling Guppy across >8 V100s…~~ it turns out these places must just be doing what we’ve done above, ‘manually’ splitting data and sharing across the cards to get the speed increases.

Follow up

The things that have me a little perplexed:

shouldn’t 2 Tesla V100s perform significantly faster than a single Titan RTX?
- guppy doesn’t appear to be optimised for any card in particular.
- the Titan has less Cuda cores but faster clock speeds, maybe guppy is hardcoded to work with X Cuda cores??
[fast mopde] why has the Titan RTX performance increased from it’s first run?
- I just ran another, exactly the same settings/code, and it took 41.65 mins…?
[hac mode] a single V100 is significantly faster than a Titan RTX in hac mode, while in fast mode they are very similar.

log outputs

The below include log outputs for each ‘run’ with more detail for those that are interested.

Titan RTX (Leviathan)

First run [fast]

# Titan RTX first run
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -i 20190606_0042_yersinia/fast5/ -r -s yersinia_basecalled/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         20190606_0042_yersinia/fast5/
save path:          yersinia_basecalled/
chunk size:         1000
chunks per runner:  20
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 20

Found 1214463 fast5 files to process.
Init time: 8112 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 3838880 ms, Samples called: 117589088085, samples/s: 3.06311e+07
Finishing up any open output files.
Basecalling completed successfully.

Second run [fast]

On this run I forgot to add the flag to compress the output, it resulted in doubling the time taken to complete base calling!

# Titan RTX second run (no compression)
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" 
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         raw_data/
save path:          yersinia_basecalled_20190614/
chunk size:         1000
chunks per runner:  20
records per file:   4000
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 20

Found 1215050 fast5 files to process.
Init time: 6899 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 5982105 ms, Samples called: 117598792614, samples/s: 1.96584e+07
Finishing up any open output files.
Basecalling completed successfully.

Third run [fast]

# Titan RTX third run (compression)
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         raw_data/
save path:          yersinia_basecalled_20190614/
chunk size:         1000
chunks per runner:  20
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 20

Found 1215050 fast5 files to process.
Init time: 7655 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2680343 ms, Samples called: 117598792614, samples/s: 4.38745e+07
Finishing up any open output files.
Basecalling completed successfully.

Forth run [fast]

I’m really not sure how this is increasing in performance?? This run took 41.65 mins…

miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190614_run2/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         raw_data/
save path:          yersinia_basecalled_20190614_run2/
chunk size:         1000
chunks per runner:  20
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 20

Found 1215050 fast5 files to process.
Init time: 6335 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2496930 ms, Samples called: 117598792614, samples/s: 4.70974e+07
Finishing up any open output files.
Basecalling completed successfully.

Fifth run [hac]

This time around we’re giving the high accuracy base calling configuration a whirl:

miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller -r -i raw_data/ -s yersinia_basecalled_20190621_hac/ -c dna_r9.4.1_450bps_hac.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         raw_data/
save path:          yersinia_basecalled_20190621_hac/
chunk size:         1000
chunks per runner:  1000
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 2

Found 1215050 fast5 files to process.
Init time: 12755 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 12227895 ms, Samples called: 117512178322, samples/s: 9.61017e+06
Finishing up any open output files.
Basecalling completed successfully.

So total calling time was ~3hr 24mins.

Sixth run [hac]

Looking to optimise based on information from this gist: https://gist.github.com/disulfidebond/00ff5a6f84a0a81057c6e5817c540569

These optimisations were made for a 2080 Ti, so obviously there is more head room on a Titan RTX. Will try this run and compare against the default config, that way we can begin to determine best params to tweak for performance.

# default high accuracy mode
~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
  -r -i raw_data/ \
  -s yersinia_basecalled_20190621_hac/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:0" \
  --compress_fastq

# param tweaking in high accuracy mode
~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
  -r -i raw_data/ \
  -s yersinia_basecalled_20190621_hac2/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:0" \
  --compress_fastq \
  --num_callers 14 --gpu_runners_per_device 8 \
  --chunks_per_runner 768 --chunk_size 500

Run with para tweaks:

# param tweaking in high accuracy mode
miles@leviathan:/data/una/20190606_yersinia$ ~/Downloads/software/guppy/ont-guppy/bin/./guppy_basecaller \
>   -r -i raw_data/ \
>   -s yersinia_basecalled_20190621_hac2/ \
>   -c dna_r9.4.1_450bps_hac.cfg \
>   -x "cuda:0" \
>   --compress_fastq \
>   --num_callers 14 --gpu_runners_per_device 8 \
>   --chunks_per_runner 768 --chunk_size 500
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /home/miles/Downloads/software/guppy/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /home/miles/Downloads/software/guppy/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         raw_data/
save path:          yersinia_basecalled_20190621_hac2/
chunk size:         500
chunks per runner:  768
records per file:   4000
fastq compression:  ON
num basecallers:    14
gpu device:         cuda:0
kernel path:
runners per device: 8

Found 1215050 fast5 files to process.
Init time: 11052 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 11162858 ms, Samples called: 117512178322, samples/s: 1.05271e+07
Finishing up any open output files.

There is a slight speed up using the modified parameters, now called in ~3hrs 6mins.

GPU RAM usage

fast base calling

miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi 
Fri Jun 14 11:52:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           On   | 00000000:18:00.0  On |                  N/A |
| 90%   87C    P2   190W / 280W |    988MiB / 24185MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2196      G   cinnamon                                     179MiB |
|    0     17313      C   .../guppy/ont-guppy/bin/./guppy_basecaller   533MiB |
|    0     29246      G   /usr/lib/firefox/firefox                       3MiB |
|    0     29272      G   /usr/lib/firefox/firefox                       3MiB |
|    0     31456      G   ...-token=F43A3D74BB834EC5E6F306FFD3FF6D0F    45MiB |
|    0     48087      G   /usr/lib/xorg/Xorg                           209MiB |
+-----------------------------------------------------------------------------+

high accuracy base calling

default config

miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi 
Fri Jun 21 14:46:09 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           On   | 00000000:18:00.0  On |                  N/A |
| 93%   87C    P2   201W / 280W |   4681MiB / 24185MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2196      G   cinnamon                                     186MiB |
|    0     29094      G   /usr/lib/firefox/firefox                       3MiB |
|    0     29246      G   /usr/lib/firefox/firefox                       3MiB |
|    0     29272      G   /usr/lib/firefox/firefox                       3MiB |
|    0     34972      G   ...-token=70703895EF88CE4F3C785AA29591F246    57MiB |
|    0     46713      C   .../guppy/ont-guppy/bin/./guppy_basecaller  4219MiB |
|    0     48087      G   /usr/lib/xorg/Xorg                           193MiB |
+-----------------------------------------------------------------------------+

tweaked params

  --num_callers 14 --gpu_runners_per_device 8 \
  --chunks_per_runner 768 --chunk_size 500

miles@leviathan:/data/una/20190606_yersinia$ nvidia-smi 
Fri Jun 21 19:52:19 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN RTX           On   | 00000000:18:00.0  On |                  N/A |
| 99%   88C    P2   215W / 280W |   6849MiB / 24185MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2196      G   cinnamon                                     186MiB |
|    0      6911      C   .../guppy/ont-guppy/bin/./guppy_basecaller  6387MiB |
|    0     29094      G   /usr/lib/firefox/firefox                       3MiB |
|    0     29246      G   /usr/lib/firefox/firefox                       3MiB |
|    0     29272      G   /usr/lib/firefox/firefox                       3MiB |
|    0     34972      G   ...-token=70703895EF88CE4F3C785AA29591F246    57MiB |
|    0     48087      G   /usr/lib/xorg/Xorg                           193MiB |
+-----------------------------------------------------------------------------+

Nvidia Tesla V100 (Orac)

Single V100

GPU0 [fast]

screen -r 190473.pts-0.kscprod-data1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_GPU0 -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path:         /scratch/20190606_yersinia/
save path:          /scratch/yersinia_basecalled_GPU0
chunk size:         1000
chunks per runner:  256
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 8

Found 1215050 fast5 files to process.
Init time: 13980 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2706948 ms, Samples called: 117598792614, samples/s: 4.34433e+07
Finishing up any open output files.
Basecalling completed successfully.

GPU1 [fast]

screen -r 40715.pts-7.kscprod-data1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_GPU1 -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path:         /scratch/20190606_yersinia/
save path:          /scratch/yersinia_basecalled_GPU1
chunk size:         1000
chunks per runner:  256
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:1
kernel path:
runners per device: 8

Found 1215050 fast5 files to process.
Init time: 11944 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2693057 ms, Samples called: 117598792614, samples/s: 4.36674e+07
Finishing up any open output files.
Basecalling completed successfully.

Dual V100

First run [fast]

This run was performed in /DSC/minion, the drive is mount NFS storage so i/o might be a limitation.

# Tesla V100 (x2) first run (on /DSC/)
orac$ ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s yersinia_called/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path:         /scratch/20190606_yersinia/
save path:          yersinia_called/
chunk size:         1000
chunks per runner:  256
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0 cuda:1
kernel path:
runners per device: 8

Found 1215050 fast5 files to process.
Init time: 14839 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2712127 ms, Samples called: 117598792614, samples/s: 4.33604e+07
Finishing up any open output files.
Basecalling completed successfully.

Second run [fast]

This run was performed in /scratch, this is an array of SSD storage (data is striped over 8 x 800GB SSDs) so i/o should be less of an issue.

# Tesla V100 (x2) first run (on /scratch/)
orac$ ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
bash: ont-guppy/bin/./guppy_basecaller: No such file or directory
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled/ -c dna_r9.4.1_450bps_fast_prom.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
input path:         /scratch/20190606_yersinia/
save path:          /scratch/yersinia_basecalled/
chunk size:         1000
chunks per runner:  256
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0 cuda:1
kernel path:
runners per device: 8

Found 1215050 fast5 files to process.
Init time: 14839 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2736231 ms, Samples called: 117598792614, samples/s: 4.29784e+07
Finishing up any open output files.
Basecalling completed successfully.

Third run (non PromethION cfg) [fast]

This run was performed across both cards using the non-PromethION optimised (read V100 optimised) configuration file (dna_r9.4.1_450bps_fast.cfg).

It appears as if there is no difference in time between the two config files:

orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/ -s /scratch/yersinia_basecalled_nonprom/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0 cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         /scratch/20190606_yersinia/
save path:          /scratch/yersinia_basecalled_nonprom/
chunk size:         1000
chunks per runner:  20
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0 cuda:1
kernel path:
runners per device: 20

Found 1215050 fast5 files to process.
Init time: 14798 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 2718941 ms, Samples called: 117598792614, samples/s: 4.32517e+07
Finishing up any open output files.
Basecalling completed successfully.

GPUs don’t scale

It appears that despite usage info (power, RAM, GPU cycles) guppy doesn’t scale across multiple cards for the same job.

Here is a link in the ‘wild’: https://bioinformatics.stackexchange.com/questions/8622/using-guppy-basecaller-on-node-with-2-gpus

So if this is the case it means that others out there already implementing multi-GPUs set ups must be spliting the run across the cards in a more ‘manual’ fashion. So, we’ll give it a whirl…

split data

Bit of a ‘brute force’ approach…

# split the data into 2
ls -dv /scratch/20190606_yersinia/20190606_0042_yersinia/fast5/* | head -n 153 | xargs -i cp -r "{}" /scratch/20190606_yersinia/split1/
ls -dv /scratch/20190606_yersinia/20190606_0042_yersinia/fast5/* | tail -n 152 | xargs -i cp -r "{}" /scratch/20190606_yersinia/split2/

fast base calling

Now run across the 2 GPUs at the same time:

# GPU0
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split1 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq
# GPU1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split2 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:1" --compress_fastq
# run this for parallel in bash
/DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split1 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:0" --compress_fastq & /DSC/minion/ont-guppy/bin/./guppy_basecaller -r -i /scratch/20190606_yersinia/split2 -s /scratch/20190606_yersinia/yersinia_basecalled_miles/ -c dna_r9.4.1_450bps_fast.cfg -x "cuda:1" --compress_fastq &

^might be a ‘nicer’ way to submit both jobs at once to each GPU…

Results:

# GPU0
Caller time: 1305472 ms, Samples called: 53663662909, samples/s: 4.11067e+07
Finishing up any open output files.
Basecalling completed successfully.
# GPU1
Caller time: 1439230 ms, Samples called: 64310185363, samples/s: 4.46837e+07
Finishing up any open output files.
Basecalling completed successfully.

As expected, when split evenly across input files the time is halved (~23 mins).

hac base calling

Now run across the 2 GPUs at the same time:

# running in screen
screen -r 98630.pts-7.kscprod-data1
# GPU0
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split1 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:0" --compress_fastq
# GPU1
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split2 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:1" --compress_fastq
# run this for parallel in bash
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split1 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
  -c dna_r9.4.1_450bps_hac.cfg -x "cuda:0" \
  --compress_fastq & \
  /DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split2 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:1" --compress_fastq &

NOTE: it seems you need to specify different output directories, the files created overwrite each other… need to invesitage this! [TODO]

Results:

# GPU0
/DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split2 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:1" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         /scratch/20190606_yersinia/split2
save path:          /scratch/20190606_yersinia/yersinia_basecalled_miles_hac2/
chunk size:         1000
chunks per runner:  1000
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:1
kernel path:
runners per device: 2

Found 606463 fast5 files to process.
Init time: 8170 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 4695179 ms, Samples called: 64269239017, samples/s: 1.36883e+07
Finishing up any open output files.
Basecalling completed successfully.

# GPU1
orac$ /DSC/minion/ont-guppy/bin/./guppy_basecaller \
  -r -i /scratch/20190606_yersinia/split1 \
  -s /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/ \
  -c dna_r9.4.1_450bps_hac.cfg \
  -x "cuda:0" --compress_fastq
ONT Guppy basecalling software version 3.1.5+781ed57
config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         /scratch/20190606_yersinia/split1
save path:          /scratch/20190606_yersinia/yersinia_basecalled_miles_hac/
chunk size:         1000
chunks per runner:  1000
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 2

Found 612000 fast5 files to process.
Init time: 10233 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 3893411 ms, Samples called: 53617652165, samples/s: 1.37714e+07
Finishing up any open output files.
Basecalling completed successfully.

Time taken for high accuracy calling when split across 2 Telsa V100’s is ~1hr 18mins 15secs.

hac one GPU

orac$ nvidia-smi
Mon Jun 24 13:05:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:2F:00.0 Off |                    0 |
| N/A   71C    P0   218W / 250W |   4480MiB / 32480MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |     11MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    276690      C   ...minion/ont-guppy/bin/./guppy_basecaller  4469MiB |
+-----------------------------------------------------------------------------+


config file:        /DSC/minion/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /DSC/minion/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         /scratch/20190606_yersinia/20190606_0042_yersinia
save path:          /scratch/20190606_yersinia/yersinia_basecalled_miles_hac_gpu0/
chunk size:         1000
chunks per runner:  1000
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:0
kernel path:
runners per device: 2

Found 1214463 fast5 files to process.
Init time: 24971 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 8550005 ms, Samples called: 117502465139, samples/s: 1.3743e+07
Finishing up any open output files.
Basecalling completed successfully.

GPU RAM usage

fast base calling

orac$ nvidia-smi
Fri Jun 14 11:51:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:2F:00.0 Off |                    0 |
| N/A   59C    P0    54W / 250W |   2074MiB / 32480MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   58C    P0    64W / 250W |   2074MiB / 32480MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     39405      C   ...minion/ont-guppy/bin/./guppy_basecaller  2063MiB |
|    1     39405      C   ...minion/ont-guppy/bin/./guppy_basecaller  2063MiB |
+-----------------------------------------------------------------------------+

high accuracy base calling

nvidia-smi
Mon Jun 24 10:52:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:2F:00.0 Off |                    0 |
| N/A   70C    P0   203W / 250W |   4480MiB / 32480MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   71C    P0   227W / 250W |   4480MiB / 32480MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    231682      C   ...minion/ont-guppy/bin/./guppy_basecaller  4469MiB |
|    1    231993      C   ...minion/ont-guppy/bin/./guppy_basecaller  4469MiB |
+-----------------------------------------------------------------------------+

file comparison

Different physical GPUs but same architecture, does it make a difference when base calling?

Total output looks the same:

orac$ du yersinia_basecalled_GPU0 -ha | tail -n 1
9.7G    yersinia_basecalled_GPU0
orac$ du yersinia_basecalled_GPU1 -ha | tail -n 1
9.7G    yersinia_basecalled_GPU1

It appears that files differ even though they have the same name:

orac$ diff -qr yersinia_basecalled_GPU0/ yersinia_basecalled_GPU1/ | grep 'differ' | wc -l 
311

cfg files

[fast] `dna_r9.4.1_450bps_fast.cfg`

# Basic configuration file for ONT Guppy basecaller software.

# Data trimming.
trim_strategy                       = dna
trim_threshold                      = 2.5
trim_min_events                     = 3

# Basecalling.
model_file                          = template_r9.4.1_450bps_fast.jsn
chunk_size                          = 1000
gpu_runners_per_device              = 20
chunks_per_runner                   = 20
chunks_per_caller                   = 10000
overlap                             = 50
qscore_offset                       = -0.098
qscore_scale                        = 0.935
builtin_scripts                     = 1

# Calibration strand detection
calib_reference                     = lambda_3.6kb.fasta
calib_min_sequence_length           = 3000
calib_max_sequence_length           = 3800
calib_min_coverage                  = 0.6

# Output.
records_per_fastq                   = 4000
min_qscore                          = 7.0

# Telemetry
ping_url                            = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration               = 60

[fast] `dna_r9.4.1_450bps_fast_prom.cfg`

# Basic configuration file for ONT Guppy basecaller software.

# Data trimming.
trim_strategy                       = dna
trim_threshold                      = 2.5
trim_min_events                     = 3

# Basecalling.
model_file                          = template_r9.4.1_450bps_fast_prom.jsn
chunk_size                          = 1000
gpu_runners_per_device              = 8
chunks_per_runner                   = 256
chunks_per_caller                   = 10000
overlap                             = 50
qscore_offset                       = 0.127
qscore_scale                        = 0.958
builtin_scripts                     = 1

# Calibration strand detection
calib_reference                     = lambda_3.6kb.fasta
calib_min_sequence_length           = 3000
calib_max_sequence_length           = 3800
calib_min_coverage                  = 0.6

# Output.
records_per_fastq                   = 4000
min_qscore                          = 7.0

# Telemetry
ping_url                            = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration               = 60

[hac] `dna_r9.4.1_450bps_hac.cfg`

This configuration implements high accuracy base calling.

# Basic configuration file for ONT Guppy basecaller software.

# Compatibility
compatible_flowcells                = FLO-FLG001,FLO-MIN106
compatible_kits                     = SQK-CAS109,SQK-DCS108,SQK-DCS109,SQK-LRK001,SQK-LSK108,SQK-LSK109,SQK-LSK109-XL,SQK-LWP001,SQK-PCS108,SQK-PCS109,SQK-PSK004,SQK-RAD002,SQK-RAD003,SQK-RAD004,SQK-RAS201,SQK-RLI001,VSK-VBK001,VSK-VSK001,VSK-VSK002
compatible_kits_with_barcoding      = SQK-16S024,SQK-PCB109,SQK-RBK001,SQK-RBK004,SQK-RLB001,SQK-LWB001,SQK-PBK004,SQK-RAB201,SQK-RAB204,SQK-RPB004,VSK-VMK001,VSK-VMK002

# Data trimming.
trim_strategy                       = dna
trim_threshold                      = 2.5
trim_min_events                     = 3

# Basecalling.
model_file                          = template_r9.4.1_450bps_hac.jsn
chunk_size                          = 1000
gpu_runners_per_device              = 2
chunks_per_runner                   = 1000
chunks_per_caller                   = 10000
overlap                             = 50
qscore_offset                       = 0.25
qscore_scale                        = 0.91
builtin_scripts                     = 1

# Calibration strand detection
calib_reference                     = lambda_3.6kb.fasta
calib_min_sequence_length           = 3000
calib_max_sequence_length           = 3800
calib_min_coverage                  = 0.6

# Output.
records_per_fastq                   = 4000
min_qscore                          = 7.0

# Telemetry
ping_url                            = https://ping.oxfordnanoportal.com/basecall
ping_segment_duration               = 60

convert ms to hrs

revtrunc <- function(x) { sign(x) * (x - floor(x)) } 

callerTime <- 4695179

timeHours <- (callerTime/1000) / 60 / 60
timeMins <- revtrunc(timeHours) * 60
timeSecs <- round(revtrunc(timeMins) * 60)

paste0(floor(timeHours), "hrs ", floor(timeMins), "mins ", timeSecs, "secs ")

A work by Miles Benton

miles.benton@esr.cri.nz

Guppy GPU benchmarking (nanopore basecalling)

Miles Benton

5/14/2019

Benchmarking Guppy base calling

GFX Hardware

CPU calling

note on cfg files

Summary (24-June-2019)

Results

Conclusion?

Follow up

log outputs

Titan RTX (Leviathan)

First run [fast]

Second run [fast]

Third run [fast]

Forth run [fast]

Fifth run [hac]

Sixth run [hac]

GPU RAM usage

fast base calling

high accuracy base calling

default config

tweaked params

Nvidia Tesla V100 (Orac)

Single V100

GPU0 [fast]

GPU1 [fast]

Dual V100

First run [fast]

Second run [fast]

Third run (non PromethION cfg) [fast]

GPUs don’t scale

split data

fast base calling

hac base calling

hac one GPU

GPU RAM usage

fast base calling

high accuracy base calling

file comparison

cfg files

[fast] dna_r9.4.1_450bps_fast.cfg

[fast] dna_r9.4.1_450bps_fast_prom.cfg

[hac] dna_r9.4.1_450bps_hac.cfg

convert ms to hrs

[fast] `dna_r9.4.1_450bps_fast.cfg`

[fast] `dna_r9.4.1_450bps_fast_prom.cfg`

[hac] `dna_r9.4.1_450bps_hac.cfg`