Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Open
Manyi-Yang opened this issue Feb 18, 2025 · 2 comments
Open

CUDA_ERROR_OUT_OF_MEMORY: out of memory Deepmd-kit V3.0.1 #4604

Manyi-Yang opened this issue Feb 18, 2025 · 2 comments
Assignees
Labels

Comments

@Manyi-Yang
Copy link

Bug summary

When I run dp train("se_e2_a") simulation with Deepmd-kit V3.0.1, I met following memory problem.
Using the same input file and computational source, but change the version of Deepmd-kit from V3.0.1 to V 2.2.9, it works well.

[2025-02-18 07:20:30,704] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-02-18 07:22:41,365] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
025-02-18 07:04:22.712642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92498 MB memory: -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
2025-02-18 07:04:22.717361: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2025-02-18 07:04:22.843798: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaab004f1070
[2025-02-18 07:04:24,676] DEEPMD INFO Adjust batch size from 1024 to 2048
[2025-02-18 07:04:24,729] DEEPMD INFO Adjust batch size from 2048 to 4096
[2025-02-18 07:04:24,786] DEEPMD INFO Adjust batch size from 4096 to 8192
[2025-02-18 07:04:24,995] DEEPMD INFO Adjust batch size from 8192 to 16384
[2025-02-18 07:04:25,642] DEEPMD INFO Adjust batch size from 16384 to 32768
[2025-02-18 07:04:26,180] DEEPMD INFO Adjust batch size from 32768 to 65536
[2025-02-18 07:04:27,249] DEEPMD INFO Adjust batch size from 65536 to 131072
2025-02-18 07:04:27.405366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28270264320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2025-02-18 07:04:27.598927: E external/local_xla/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2025-02-18 07:04:27.599079: F tensorflow/core/common_runtime/device/device_event_mgr.cc:223] Unexpected Event status: 1

DeePMD-kit Version

Deepmd-kit V3.0.1

Backend and its version

TensorFlow

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

"descriptor": { "type": "se_e2_a", "_sel": [90, 80, 120, 12], "rcut_smth": 3.0, "rcut": 4.0, "neuron": [ 32, 64, 128 ], "resnet_dt": false, "axis_neuron": 16, "seed": 7232, "_comment": " that's all" }, "fitting_net": { "neuron": [ 240, 240, 240, 240

Steps to Reproduce

dp train --mpi-log=master input.json >> train_gpu.log 2>&1

Further Information, Files, and Links

No response

@Yi-FanLi
Copy link
Collaborator

Yi-FanLi commented Feb 19, 2025

Hi @Manyi-Yang, we encountered this issue previously, and we thought it was due to some update of TensorFlow. Therefore, we added the info about setting the batch size in the printed message: #3822, #4283. Could you please try to set export DP_INFER_BATCH_SIZE=65536 or even a smaller value, and see if the error still appears?

@Manyi-Yang
Copy link
Author

Hello YiFan,

Thank you very much for your reply.
I have test it with export DP_INFER_BATCH_SIZE=65536. It looks like it does not work here. I've also tried using smaller values such as 500, but that didn't work either.

> [2025-02-20 02:44:13,838] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
> [2025-02-20 02:46:39,386] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
> 2025-02-20 02:46:39.660054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 92505 MB memory:  -> device: 0, name: NVIDIA GH200 120GB, pci bus id: 0009:01:00.0, compute capability: 9.0
> 2025-02-20 02:46:39.664742: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
> 2025-02-20 02:46:39.819202: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0xaaaad9c11df0
> [2025-02-20 02:46:42,876] DEEPMD INFO    Adjust batch size from 1024 to 2048
> [2025-02-20 02:46:42,928] DEEPMD INFO    Adjust batch size from 2048 to 4096
> [2025-02-20 02:46:42,986] DEEPMD INFO    Adjust batch size from 4096 to 8192
> [2025-02-20 02:46:43,196] DEEPMD INFO    Adjust batch size from 8192 to 16384
> [2025-02-20 02:46:43,865] DEEPMD INFO    Adjust batch size from 16384 to 32768
> [2025-02-20 02:46:44,422] DEEPMD INFO    Adjust batch size from 32768 to 65536
> [2025-02-20 02:46:45,507] DEEPMD INFO    Adjust batch size from 65536 to 131072
> 2025-02-20 02:46:45.874190: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 26.33GiB (28277473280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
> 2025-02-20 02:46:45.875366: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 23.70GiB (25449725952 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.877794: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 21.33GiB (22904752128 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879037: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 19.20GiB (20614277120 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879084: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 17.28GiB (18552848384 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879103: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 15.55GiB (16697563136 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.879121: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 14.00GiB (15027806208 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880518: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 12.60GiB (13525024768 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880546: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 11.34GiB (12172521472 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880573: W external/local_tsl/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
> 2025-02-20 02:46:45.880600: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x4003e4000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880631: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400400000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880666: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400420000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.880883: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400460000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.882674: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400580000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.884786: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1118] failed to free device memory at 0x400720000000; result: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887071: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:45.887606: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.887768: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916206: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1101] failed to allocate 40.33GiB (43309858816 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
> 2025-02-20 02:46:55.916241: W external/local_tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 10.37GiB (rounded to 11132284160)requested by op Sum
> If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
> Current allocation summary follows.
> Current allocation summary follows.
> 2025-02-20 02:46:55.916259: I external/local_tsl/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
> 2025-02-20 02:46:55.916278: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (256):        Total Chunks: 10, Chunks in use: 10. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 52B client-requested in use in bin.
> 2025-02-20 02:46:55.916289: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (512):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916300: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (1024):       Total Chunks: 2, Chunks in use: 2. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 2.0KiB client-requested in use in bin.
> 2025-02-20 02:46:55.916309: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (2048):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916316: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (4096):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916323: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (8192):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916329: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (16384):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916335: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (32768):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916341: I external/local_tsl/tsl/framework/bfc_allocator.cc:1046] Bin (65536):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
> 2025-02-20 02:46:55.916355: F external/local_tsl/tsl/framework/bfc_allocator.cc:1043] Check failed: b->free_chunks.size() == bin_info.total_chunks_in_bin - bin_info.total_chunks_in_use (1 vs. 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants