Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Thrust 1.9.8 (CUDA Toolkit 11.0 Early Access)

Compare
Choose a tag to compare
@brycelelbach brycelelbach released this 16 May 08:27
· 967 commits to master since this release

Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes Thrust's internal derivative of CUB, upstreams all relevant changes too CUB, and adds CUB as a Git submodule. It will now be necessary to do git clone --recursive when checking out Thrust, and to update the CUB submodule when pulling in new Thrust changes. Additionally, CUB is now included as a first class citizen in the CUDA toolkit. Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working with more than 2^31-1 elements. Now, thrust::reduce, thrust::*_scan, and related algorithms (aka most of Thrust) work with large element counts.

Breaking Changes

  • Thrust will now use the version of CUB in your include path instead of its own internal copy. If you are using your own version of CUB, it may be older and incompatible with Thrust. It is recommended to simply delete your own version of CUB and use the version of CUB that comes with Thrust.

Other Enhancements

  • Refactor Thrust and CUB to support 64-bit indices in most algorithms. In most cases, Thrust now selects between kernels that use 32-bit indices and 64-bit indices at runtime depending on the size of the input. This means large element counts work, but small element counts do not have to pay for the register usage of 64-bit indices if they are not needed. Now, thrust::reduce, thrust::*_scan, and related algorithms (aka most of Thrust) work with more than 2^31-1 elements. Notably, thrust::sort is still limited to less than 2^31-1 elements.
  • CUB is now a submodule and the internal copy of CUB has been removed.
  • #1051: Stop specifying the __launch_bounds__ minimum blocks parameter because it messes up register allocation and increases register pressure, and we don't actually know at compile time how many blocks we will use (aside from single tile kernels).

Bug Fixes

  • #1020: After making a CUDA API call, always clear the global CUDA error state by calling cudaGetLastError.
  • #1021: Avoid calling destroy in the destructor of a Thrust vector if the vector is empty.
  • #1046: Actually throw thrust::bad_alloc when thrust::system::cuda::malloc fails instead of just constructing a temporary and doing nothing with it.
  • Add missing copy constructor or copy assignment operator to all classes that GCC 9's -Wdeprecated-copy complains about
  • Add missing move operations to thrust::system::cuda::vector.
  • #1015: Check that the backend is CUDA before using CUDA-specifics in thrust::detail::temporary_allocator. Thanks to Hugh Winkler for this contribution.
  • #1055: More correctly detect the presence of aligned/sized new/delete.
  • #1043: Fix ill-formed specialization of thrust::system::is_error_code_enum for thrust::event_errc. Thanks to Toru Niina for this contribution.
  • #1027: Add tests for thrust::tuple_for_each and thrust::tuple_subset. Thanks to Ben Jude for this contribution.
  • #1027: Use correct macro in thrust::tuple_for_each. Thanks to Ben Jude for this contribution.
  • #1026: Use correct MSVC version formatting in CMake. Thanks to Ben Jude for this contribution.
  • Workaround an NVCC issue with type aliases with template template arguments containing a parameter pack.
  • Remove unused functions from the CUDA backend which call slow CUDA attribute query APIs.
  • Replace CUB_RUNTIME_FUNCTION with THRUST_RUNTIME_FUNCTION.
  • Correct typo in thrust::transform documentation. Thanks to Eden Yefet for this contribution.

Known Issues

  • thrust::sort remains limited to 2^31-1 elements for now.