About Torch Extension ERROR. #1786
Unanswered
chenjunyu66
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I rewrote Megatron_LM with DeepSpeed, I encountered the following error while build the torch extension.
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: internal error, NCCL version 21.0.3 ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
Beta Was this translation helpful? Give feedback.
All reactions