Slow Inference Speeds on Cluster GPUs #833

jmdelahanty · 2022-07-08T21:15:46Z

jmdelahanty
Jul 8, 2022

Hello everyone!

Something that has seemed to happen in our lab's use of SLEAP while on the SNL cluster is that our inference speeds are around 30FPS or so on a GPU.

From the methods paper (which I think had things trained on A100 GPUs?) there's reported speeds of several hundred FPS! Crazy cool speed!

While we aren't using A100s (yet), I would think that 30FPS is somewhat slow. Here's the specs of an example machine people are running on:

(base) jdelahanty@tesla:/snlkt/data/facial_expression/specialk$ nvidia-smi -l 1
Fri Jul  8 13:41:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   33C    P0    34W / 250W |   8353MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P0    34W / 250W |   8353MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:08:00.0 Off |                    0 |
| N/A   30C    P0    32W / 250W |   8353MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:09:00.0 Off |                    0 |
| N/A   31C    P0    33W / 250W |   8353MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K40c          Off  | 00000000:88:00.0 Off |                    0 |
| 23%   26C    P8    20W / 235W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     37797      C   ...da3/envs/sleap/bin/python     8351MiB |
|    1   N/A  N/A     38257      C   ...da3/envs/sleap/bin/python     8351MiB |
|    2   N/A  N/A      1241      C   ...da3/envs/sleap/bin/python     8351MiB |
|    3   N/A  N/A      6198      C   ...da3/envs/sleap/bin/python     8351MiB |

You can even see that there's some people running sleap on it!

Here's some info about the machine itself:

tesla is managed by SNL Support <[email protected]> and runs Debian Stretch v9.X.
CPU Count: 40 : "GenuineIntel Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (2 chips x 10 cores : 20 hyperthread cores)"

Since this computer only has 20 cores, so 40 with hyperthreading, maybe that could be a bottleneck for it? It also looks like the batch size of the inferences could be almost doubled since since lots of memory is available.

Perhaps the most important thing to notice is that the GPU-Util value is 0%! I have a feeling that even though people are trying to specify GPU usage, the GPU isn't really doing much work! I'm basing my understanding of what the GPU-Util parameter says from this Medium post and these docs.

Here's an attempt to query things and it's output:

(base) jdelahanty@tesla:/snlkt/data/facial_expression/specialk$ nvidia-smi --query-gpu=timestamp,pci.bus_id,utilization.gpu,driver_version,name --format
=csv
timestamp, pci.bus_id, utilization.gpu [%], driver_version, name
2022/07/08 14:14:25.791, 00000000:04:00.0, 0 %, 460.106.00, Tesla P100-PCIE-16GB
2022/07/08 14:14:25.796, 00000000:05:00.0, 0 %, 460.106.00, Tesla P100-PCIE-16GB
2022/07/08 14:14:25.801, 00000000:08:00.0, 0 %, 460.106.00, Tesla P100-PCIE-16GB
2022/07/08 14:14:25.805, 00000000:09:00.0, 0 %, 460.106.00, Tesla P100-PCIE-16GB
2022/07/08 14:14:25.807, 00000000:88:00.0, 0 %, 460.106.00, Tesla K40c

Edit:

After staring at nvidia-smi polling for a few minutes, sometimes the GPU-Util gets to 15% for a second, then drops down to zero again.

Edit2:

It could be that they were running an older version of SLEAP that had latency between the CPU and GPU from a bug that Arlo mentioned he fixed. It looks like on most recent versions of SLEAP the GPU is getting utilized, but only gets up to something like 50%...

I could be misinterpreting things, but it looks to me that things are being loaded into the GPU but the card isn't actually doing anything for some reason!

The rate they're seeing printed to the terminal shows a value hovering around 5. I'm not sure what that rate means. Is it literally the frame rate inference?

I was asking @sheridana about this briefly today and he suggested a couple things to check:

Try doing inference when the data is stored on the local machine instead of being read over the network
Try just running the video with all default settings and see if it's something particular about the settings we're using
See if it depends on how much CPU is available, its possible that there's a lot of latency that builds up over time when transferring data from CPU to GPU
Look into whether or not the GPU is actually even being used (which it looks like it might not be!)

Any advice?

Answered by jmdelahanty

Sep 29, 2022

To simplify the info in the discussion here, two things really helped us achieve speed ups from 5-10 FPS to over 400FPS:

Correct drivers: SLEAP/other GPU dependent software doesn't seem to crash as much as struggle if the non-optimal drivers are being referenced in your linux terminal.
Separating out CPU and GPU workloads as @talmo suggested. This helps enormously with people's workflows. On A40s it appears that doing both CPU/GPU tasks in the same sleap-track call is sufficiently fast, but each step is about 2x as fast if they are separated out.

View full answer

talmo · 2022-07-13T16:40:29Z

talmo
Jul 13, 2022
Maintainer

The most likely bottleneck is reading the data, especially if it's stored over the network.

The CPU stuff is very unlikely -- most of the heavy lifting in SLEAP is done on the GPU, so 20 cores is way overkill.

Other things to try:

Don't enable the tracker to see if the performance is limited by the model inference alone.
Increase the batch size. P100s are pretty big in terms of memory, so try 16 or 32.
Try training a smaller model. Model sizes of ~1-4 million parameters are considered very lightweight and should get you triple digit FPS.
If possible, downscale your images by 0.5 or 0.75x if you won't lose resolution.
Depending on your use case, implementing a custom inference pipeline can optimize things if you want to handle the data loading yourself in a more specialized way.

24 replies

jmdelahanty Sep 29, 2022
Author

Turns out this did not solve the issue in the linked discussion. It's happening sporadically and only to one specific project in the lab. Super odd! However, separating out the GPU and CPU workloads has dramatically improved performance. On our cluster's A40s, we're achieving inference speeds of 400+FPS when only inference is done and the tracking portion takes just a couple minutes to complete on CPU. Things that were once taking a bunch of time (on the order of something like 4 hours!) now only takes like 10 minutes! Amazing!

boadecea25 Jul 25, 2024

You can try doing each separately.

Time how long the video loading takes when doing it batch by batch:

video = sleap.load_video("video.mp4")
batch_size = 16
n_frames = 4096 # or len(video), but this might take a while
for i0 in range(0, n_frames, batch_size):
    t0 = perf_counter()
    i1 = min(i0 + batch_size, len(video))
    imgs = video[i0:i1]
    elapsed = perf_counter() - t0
    fps = len(imgs) / elapsed
    print(fps)

And the tracking using the approach in this notebook.

Hello @talmo , I did this and with preloading frames, sleap's inference remains the same despite excluding tracking as well. What could be the issue in this case?

jmdelahanty Jul 25, 2024
Author

I think what this snippet does is just test what the fps is for just reading video frames from a video. The FPS that finds should tell you about what loading videos is like but doesn't change what the inference step is doing. This just can tell you if there's a bottleneck in loading video frames or something else I'm pretty sure.

boadecea25 Jul 25, 2024

So I ran a benchmarking code for all (loading videos, inference by sleap with a preloaded video and tracking). Surprisingly, the fps for loading video and tracking was around 250 while just sleap was 11? Is this an innate problem with sleap/ my gpu/ what could it be?
From this what I understand is the bottleneck is the inference part by itself which is unexpected.

jmdelahanty Sep 13, 2024
Author

The only thing I can think of that could be a problem is driver compatibility for something like that. Unfortunately for me, I'm not so experienced with debugging driver issues and I'm not even sure where to start for determining the best ones for things! Maybe @talmo has some advice?

jmdelahanty · 2022-09-29T21:13:47Z

jmdelahanty
Sep 29, 2022
Author

To simplify the info in the discussion here, two things really helped us achieve speed ups from 5-10 FPS to over 400FPS:

Correct drivers: SLEAP/other GPU dependent software doesn't seem to crash as much as struggle if the non-optimal drivers are being referenced in your linux terminal.
Separating out CPU and GPU workloads as @talmo suggested. This helps enormously with people's workflows. On A40s it appears that doing both CPU/GPU tasks in the same sleap-track call is sufficiently fast, but each step is about 2x as fast if they are separated out.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Inference Speeds on Cluster GPUs #833

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 24 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Slow Inference Speeds on Cluster GPUs #833

jmdelahanty Jul 8, 2022

Replies: 2 comments · 24 replies

talmo Jul 13, 2022 Maintainer

jmdelahanty Sep 29, 2022 Author

boadecea25 Jul 25, 2024

jmdelahanty Jul 25, 2024 Author

boadecea25 Jul 25, 2024

jmdelahanty Sep 13, 2024 Author

jmdelahanty Sep 29, 2022 Author

jmdelahanty
Jul 8, 2022

Replies: 2 comments 24 replies

talmo
Jul 13, 2022
Maintainer

jmdelahanty Sep 29, 2022
Author

jmdelahanty Jul 25, 2024
Author

jmdelahanty Sep 13, 2024
Author

jmdelahanty
Sep 29, 2022
Author