-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604
Comments
Hi @Manyi-Yang, we encountered this issue previously, and we thought it was due to some update of TensorFlow. Therefore, we added the info about setting the batch size in the printed message: #3822, #4283. Could you please try to set |
Hello YiFan, Thank you very much for your reply.
|
Bug summary
When I run dp train("se_e2_a") simulation with Deepmd-kit V3.0.1, I met following memory problem.
Using the same input file and computational source, but change the version of Deepmd-kit from V3.0.1 to V 2.2.9, it works well.
[2025-02-18 07:20:30,704] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-02-18 07:22:41,365] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
025-02-18 07:04:22.712642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92498 MB memory: -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2025-02-18 07:04:22.717361: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2025-02-18 07:04:22.843798: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaab004f1070
[2025-02-18 07:04:24,676] DEEPMD INFO Adjust batch size from 1024 to 2048
[2025-02-18 07:04:24,729] DEEPMD INFO Adjust batch size from 2048 to 4096
[2025-02-18 07:04:24,786] DEEPMD INFO Adjust batch size from 4096 to 8192
[2025-02-18 07:04:24,995] DEEPMD INFO Adjust batch size from 8192 to 16384
[2025-02-18 07:04:25,642] DEEPMD INFO Adjust batch size from 16384 to 32768
[2025-02-18 07:04:26,180] DEEPMD INFO Adjust batch size from 32768 to 65536
[2025-02-18 07:04:27,249] DEEPMD INFO Adjust batch size from 65536 to 131072
2025-02-18 07:04:27.405366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28270264320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2025-02-18 07:04:27.598927: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2025-02-18 07:04:27.599079: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1
DeePMD-kit Version
Deepmd-kit V3.0.1
Backend and its version
TensorFlow
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
"descriptor": { "type": "se_e2_a", "_sel": [90, 80, 120, 12], "rcut_smth": 3.0, "rcut": 4.0, "neuron": [ 32, 64, 128 ], "resnet_dt": false, "axis_neuron": 16, "seed": 7232, "_comment": " that's all" }, "fitting_net": { "neuron": [ 240, 240, 240, 240
Steps to Reproduce
dp train --mpi-log=master input.json >> train_gpu.log 2>&1
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: