Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] "NATIVE" does not work correctly in systems with Multiple GPUs with different architectures #605

Open
jperez999 opened this issue May 10, 2024 · 1 comment
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jperez999
Copy link

Describe the bug
I have a system that has multiple architectures of GPUs [sm60, sm75, sm89]. When I try to build raft, I get a failure because I am told that only sm70 and above is supported.

Steps/Code to reproduce bug
On a system with both supported and not supported GPU try to build raft ./build.sh libraft --compile-lib.

Expected behavior
I expect the build to complete for supported architectures that I have available on my machine only, and that I wont hit this build error.

Environment details (please complete the following information):
Baremetal

  • CMake version - 3.29.2

Here is a copy of the error I recieve:

FAILED: CMakeFiles/raft_objs.dir/src/distance/fused_distance_nn.cu.o 
/home/jperez/.conda/envs/rapids_raft/bin/nvcc -forward-unknown-to-host-compiler -DCUTLASS_NAMESPACE=raft_cutlass -DFMT_HEADER_ONLY=1 -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DNVTX_ENABLED -DRAFT_COMPILED -DRAFT_EXPLICIT_INSTANTIATE_ONLY -DRAFT_SYSTEM_LITTLE_ENDIAN=1 -DSPDLOG_FMT_EXTERNAL -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR -I/raid/workspace/raft/cpp/include -I/raid/workspace/raft/cpp/build/_deps/hnswlib-src -I/raid/workspace/raft/cpp/build/_deps/cccl-src/thrust/thrust/cmake/../.. -I/raid/workspace/raft/cpp/build/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/../../../include -I/raid/workspace/raft/cpp/build/_deps/cccl-src/cub/cub/cmake/../.. -I/raid/workspace/raft/cpp/build/_deps/nvidiacutlass-src/include -I/raid/workspace/raft/cpp/build/_deps/nvidiacutlass-build/include -isystem /home/jperez/.conda/envs/rapids_raft/include -isystem /home/jperez/.conda/envs/rapids_raft/targets/x86_64-linux/include -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_61,code=[sm_61]" "--generate-code=arch=compute_75,code=[sm_75]" "--generate-code=arch=compute_89,code=[sm_89]" -Xcompiler=-fPIC -Xcompiler=-Wno-deprecated-declarations -DRAFT_HIDE_DEPRECATION_WARNINGS -Xcompiler=-Wall,-Werror,-Wno-error=deprecated-declarations -Werror=all-warnings --expt-extended-lambda --expt-relaxed-constexpr -DCUDA_API_PER_THREAD_DEFAULT_STREAM -Xfatbin=-compress-all -Xcompiler=-fopenmp -Xcompiler -pthread -MD -MT CMakeFiles/raft_objs.dir/src/distance/fused_distance_nn.cu.o -MF CMakeFiles/raft_objs.dir/src/distance/fused_distance_nn.cu.o.d -x cu -c /raid/workspace/raft/cpp/src/distance/fused_distance_nn.cu -o CMakeFiles/raft_objs.dir/src/distance/fused_distance_nn.cu.o
In file included from /raid/workspace/raft/cpp/build/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/../../../include/cuda/semaphore:14,
                 from /raid/workspace/raft/cpp/include/raft/distance/detail/fused_distance_nn/epilogue_elementwise.cuh:59,
                 from /raid/workspace/raft/cpp/include/raft/distance/detail/fused_distance_nn/cutlass_base.cuh:29,
                 from /raid/workspace/raft/cpp/include/raft/distance/detail/fused_distance_nn.cuh:22,
                 from /raid/workspace/raft/cpp/include/raft/distance/fused_distance_nn-inl.cuh:23,
                 from /raid/workspace/raft/cpp/src/distance/fused_distance_nn.cu:18:
/raid/workspace/raft/cpp/build/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/../../../include/cuda/std/semaphore:12:4: error: #error "CUDA synchronization primitives are only supported for sm_70 and up."
   12 | #  error "CUDA synchronization primitives are only supported for sm_70 and up."
      |    ^~~~~
@robertmaynard
Copy link
Contributor

Can you please post the contents of raft/cpp/build/CMakeFiles/CMakeConfigureLog.yaml and raft/cpp/build/eval_gpu_archs.stderr.log. If possible also run the eval_gpu_archs executable in raft/cpp/build/ and verify it outputs the architectures on your machine.

The compile line you have included has zero CUDA arch's specified, so nvcc is defaulting to build for sm52. Which would occur if rapids-cmake failed to compile the detection program, or the cuda driver or runtime wasn't being loaded properly ( LD_LIBRARY_PATH not set ). Another possibility is you are building inside a container verify that you are enabling GPU passthrough.

@robertmaynard robertmaynard self-assigned this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants