Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945

Open
asdfry opened this issue Jan 13, 2025 · 1 comment
Open
Assignees

Comments

@asdfry
Copy link

asdfry commented Jan 13, 2025

Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.

Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.

Environment
Base Image: nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version: 0.16.2

Build Log
docker_build.log

Additional Context
The problem does not occur with the newer base image nvcr.io/nvidia/pytorch:24.05-py3.

Thank you.

@loadams loadams self-assigned this Jan 13, 2025
@loadams
Copy link
Contributor

loadams commented Jan 13, 2025

Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants