Eval code not syncing between GPUs #83

SnakeOnex · 2025-02-03T19:09:47Z

Hello,

looking at train_utils.py:581, in the eval_reconstruction section, there doesn't seem to be any sync of eval values between the GPUs. So each GPU computes rFID etc on part of the eval set, but then only the value from main GPU is reported. Not sure if I am looking at it right or if you guys are aware of this and want it this, but wanted to bring it to your attention just in case.

cornettoyu · 2025-02-03T22:17:50Z

Thanks for the note.

The current code should run evaluation on the full val set on each device (GPU), so the online evaluation still has the correct numbers but the computation could be wasted (since each device will test the whole dataset). We will check on our side and update with a distributed version to speed up the evaluation process soon. Feel free to let me know if you have any question :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval code not syncing between GPUs #83

Eval code not syncing between GPUs #83

SnakeOnex commented Feb 3, 2025

cornettoyu commented Feb 3, 2025

Eval code not syncing between GPUs #83

Eval code not syncing between GPUs #83

Comments

SnakeOnex commented Feb 3, 2025

cornettoyu commented Feb 3, 2025