Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant drop in the model's performance metric (Top K Accuracy) when we go from 1 GPU to 2 or 4 GPUs. #733

Open
hugoferrero opened this issue Aug 8, 2024 · 0 comments

Comments

@hugoferrero
Copy link

Hi, as the title says, the performance of the model drops when I use a cluster of GPUs.
The (custom) training is being done in vertex training service.
This is the image I am using: us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-9:latest
These are the machine types: a2-ultragpu-1g, a2-ultragpu-2g, a2-ultragpu-4g each with 1, 2 and 4 GPUs respectively.
I'm following this tutorial:
https://www.tensorflow.org/recommenders/examples/diststrat_retrieval
This is my implementation of the strategy:
strategy

tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice(reduce_to_device="cpu:0"))

I increased the batch size in 2 and 4 GPUS
batch_size_2GPUs = batch_size_1GPUx2
batch_size_4GPUs = batch_size_1GPUx4

Is there anything else I need to do, at the code level, to have the same performance values in each case?
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant