CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Manyi-Yang · 2025-02-18T06:33:11Z

Bug summary

When I run dp train("se_e2_a") simulation with Deepmd-kit V3.0.1, I met following memory problem.
Using the same input file and computational source, but change the version of Deepmd-kit from V3.0.1 to V 2.2.9, it works well.

[2025-02-18 07:20:30,704] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-02-18 07:22:41,365] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
025-02-18 07:04:22.712642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92498 MB memory: -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2025-02-18 07:04:22.717361: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2025-02-18 07:04:22.843798: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaab004f1070
[2025-02-18 07:04:24,676] DEEPMD INFO Adjust batch size from 1024 to 2048
[2025-02-18 07:04:24,729] DEEPMD INFO Adjust batch size from 2048 to 4096
[2025-02-18 07:04:24,786] DEEPMD INFO Adjust batch size from 4096 to 8192
[2025-02-18 07:04:24,995] DEEPMD INFO Adjust batch size from 8192 to 16384
[2025-02-18 07:04:25,642] DEEPMD INFO Adjust batch size from 16384 to 32768
[2025-02-18 07:04:26,180] DEEPMD INFO Adjust batch size from 32768 to 65536
[2025-02-18 07:04:27,249] DEEPMD INFO Adjust batch size from 65536 to 131072
2025-02-18 07:04:27.405366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28270264320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2025-02-18 07:04:27.598927: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2025-02-18 07:04:27.599079: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1

DeePMD-kit Version

Deepmd-kit V3.0.1

Backend and its version

TensorFlow

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

"descriptor": { "type": "se_e2_a", "_sel": [90, 80, 120, 12], "rcut_smth": 3.0, "rcut": 4.0, "neuron": [ 32, 64, 128 ], "resnet_dt": false, "axis_neuron": 16, "seed": 7232, "_comment": " that's all" }, "fitting_net": { "neuron": [ 240, 240, 240, 240

Steps to Reproduce

dp train --mpi-log=master input.json >> train_gpu.log 2>&1

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Yi-FanLi · 2025-02-19T15:31:26Z

Hi @Manyi-Yang, we encountered this issue previously, and we thought it was due to some update of TensorFlow. Therefore, we added the info about setting the batch size in the printed message: #3822, #4283. Could you please try to set export DP_INFER_BATCH_SIZE=65536 or even a smaller value, and see if the error still appears?

Manyi-Yang · 2025-02-20T03:17:35Z

Hello YiFan,

Thank you very much for your reply.
I have test it with export DP_INFER_BATCH_SIZE=65536. It looks like it does not work here. I've also tried using smaller values such as 500, but that didn't work either.

> [2025-02-20 02:44:13,838] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
> [2025-02-20 02:46:39,386] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
> 2025-02-20 02:46:39.660054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92505 MB memory:  -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
> 2025-02-20 02:46:39.664742: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
> 2025-02-20 02:46:39.819202: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaad9c11df0
> [2025-02-20 02:46:42,876] DEEPMD INFO    Adjust batch size from 1024 to 2048
> [2025-02-20 02:46:42,928] DEEPMD INFO    Adjust batch size from 2048 to 4096
> [2025-02-20 02:46:42,986] DEEPMD INFO    Adjust batch size from 4096 to 8192
> [2025-02-20 02:46:43,196] DEEPMD INFO    Adjust batch size from 8192 to 16384
> [2025-02-20 02:46:43,865] DEEPMD INFO    Adjust batch size from 16384 to 32768
> [2025-02-20 02:46:44,422] DEEPMD INFO    Adjust batch size from 32768 to 65536
> [2025-02-20 02:46:45,507] DEEPMD INFO    Adjust batch size from 65536 to 131072
> 2025-02-20 02:46:45.874190: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28277473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
> 2025-02-20 02:46:45.875366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 23.70GiB (25449725952 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.877794: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 21.33GiB (22904752128 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879037: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 19.20GiB (20614277120 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879084: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 17.28GiB (18552848384 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879103: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 15.55GiB (16697563136 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879121: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 14.00GiB (15027806208 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880518: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 12.60GiB (13525024768 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880546: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 11.34GiB (12172521472 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880573: W external/local_tsl/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
> 2025-02-20 02:46:45.880600: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x4003e4000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880631: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400400000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880666: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400420000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880883: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400460000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.882674: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400580000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.884786: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400720000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887071: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887606: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.887768: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916206: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916241: W external/local_tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 10.37GiB (rounded to 11132284160)requested by op Sum
> If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
> Current allocation summary follows.
> Current allocation summary follows.
> 2025-02-20 02:46:55.916259: I external/local_tsl/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
> 2025-02-20 02:46:55.916278: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (256):        Total Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 52B client-requested in use in bin.
> 2025-02-20 02:46:55.916289: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (512):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916300: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (1024):       Total Chunks: 2, Chunks in use: 2. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 2.0KiB client-requested in use in bin.
> 2025-02-20 02:46:55.916309: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (2048):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916316: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (4096):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916323: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (8192):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916329: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (16384):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916335: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (32768):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916341: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (65536):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916355: F external/local_tsl/tsl/framework/bfc_allocator.cc:1043] Check failed: b->free_chunks.size() == bin_info.total_chunks_in_bin - bin_info.total_chunks_in_use (1 vs. 2)

Manyi-Yang added the bug label Feb 18, 2025

wanghan-iapcm assigned njzjz Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Manyi-Yang commented Feb 18, 2025

Yi-FanLi commented Feb 19, 2025 •

edited by njzjz

Loading

Manyi-Yang commented Feb 20, 2025

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Comments

Manyi-Yang commented Feb 18, 2025

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Yi-FanLi commented Feb 19, 2025 • edited by njzjz Loading

Manyi-Yang commented Feb 20, 2025

Yi-FanLi commented Feb 19, 2025 •

edited by njzjz

Loading