Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

hugoferrero · 2024-08-08T20:20:13Z

Hi, as the title says, the performance of the model drops when I use a cluster of GPUs.
The (custom) training is being done in vertex training service.
This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest
These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively.
I'm following this tutorial:
https://www.tensorflow.org/recommenders/examples/diststrat_retrieval
This is my implementation of the strategy:
strategy

tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device="cpu:0"))

I increased the batch size in 2 and 4 GPUS
batch_size_2GPUs = batch_size_1GPUx2
batch_size_4GPUs = batch_size_1GPUx4

Is there anything else I need to do, at the code level, to have the same performance values in each case?
Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

hugoferrero commented Aug 8, 2024

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

Comments

hugoferrero commented Aug 8, 2024