This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Thrust 1.9.8 (CUDA Toolkit 11.0 Early Access)
Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes Thrust's internal derivative of CUB, upstreams all relevant changes too CUB, and adds CUB as a Git submodule. It will now be necessary to do git clone --recursive
when checking out Thrust, and to update the CUB submodule when pulling in new Thrust changes. Additionally, CUB is now included as a first class citizen in the CUDA toolkit. Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working with more than 2^31-1
elements. Now, thrust::reduce
, thrust::*_scan
, and related algorithms (aka most of Thrust) work with large element counts.
Breaking Changes
- Thrust will now use the version of CUB in your include path instead of its own internal copy. If you are using your own version of CUB, it may be older and incompatible with Thrust. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.
Other Enhancements
- Refactor Thrust and CUB to support 64-bit indices in most algorithms. In most cases, Thrust now selects between kernels that use 32-bit indices and 64-bit indices at runtime depending on the size of the input. This means large element counts work, but small element counts do not have to pay for the register usage of 64-bit indices if they are not needed. Now,
thrust::reduce
,thrust::*_scan
, and related algorithms (aka most of Thrust) work with more than2^31-1
elements. Notably,thrust::sort
is still limited to less than2^31-1
elements. - CUB is now a submodule and the internal copy of CUB has been removed.
- #1051: Stop specifying the
__launch_bounds__
minimum blocks parameter because it messes up register allocation and increases register pressure, and we don't actually know at compile time how many blocks we will use (aside from single tile kernels).
Bug Fixes
- #1020: After making a CUDA API call, always clear the global CUDA error state by calling
cudaGetLastError
. - #1021: Avoid calling destroy in the destructor of a Thrust vector if the vector is empty.
- #1046: Actually throw
thrust::bad_alloc
whenthrust::system::cuda::malloc
fails instead of just constructing a temporary and doing nothing with it. - Add missing copy constructor or copy assignment operator to all classes that GCC 9's
-Wdeprecated-copy
complains about - Add missing move operations to
thrust::system::cuda::vector
. - #1015: Check that the backend is CUDA before using CUDA-specifics in
thrust::detail::temporary_allocator
. Thanks to Hugh Winkler for this contribution. - #1055: More correctly detect the presence of aligned/sized
new
/delete
. - #1043: Fix ill-formed specialization of
thrust::system::is_error_code_enum
forthrust::event_errc
. Thanks to Toru Niina for this contribution. - #1027: Add tests for
thrust::tuple_for_each
andthrust::tuple_subset
. Thanks to Ben Jude for this contribution. - #1027: Use correct macro in
thrust::tuple_for_each
. Thanks to Ben Jude for this contribution. - #1026: Use correct MSVC version formatting in CMake. Thanks to Ben Jude for this contribution.
- Workaround an NVCC issue with type aliases with template template arguments containing a parameter pack.
- Remove unused functions from the CUDA backend which call slow CUDA attribute query APIs.
- Replace
CUB_RUNTIME_FUNCTION
withTHRUST_RUNTIME_FUNCTION
. - Correct typo in
thrust::transform
documentation. Thanks to Eden Yefet for this contribution.
Known Issues
thrust::sort
remains limited to2^31-1
elements for now.