Thrust 1.7.0 (CUDA Toolkit 5.5)
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge
and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate
has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key
implementation provides 80% faster performance.
Breaking API Changes
- Dispatch
-
Custom user backend systems' tag types must now inherit from the corresponding system's
execution_policy
template (e.g.thrust::cuda::execution_policy
) instead of the tagstruct
(e.g.thrust::cuda::tag
). Otherwise, algorithm specializations will silently go unfound during dispatch.See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
-
thrust::advance
andthrust::distance
are no longer dispatched based on iterator system type and thus may no longer be customized.
-
- Iterators
iterator_facade
anditerator_adaptor
'sPointer
template parameters have been eliminated.iterator_adaptor
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_adaptor
).iterator_facade
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_facade
).iterator_core_access
has been moved into thethrust
namespace (previouslythrust::experimental::iterator_core_access
).
All iterators' nested pointertypedef
(the type of the result ofoperator->
) is nowvoid
instead of a pointer type to indicate that such expressions are currently impossible.
Floating pointcounting_iterators
' nesteddifference_type
typedef
is now a signed integral type instead of a floating point type.
- Other
normal_distribution
has been moved into thethrust::random
namespace (previouslythrust::random::experimental::normal_distribution
).- Placeholder expressions may no longer include the comma operator.
New Features
- Execution Policies
-
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by
cudaMalloc
withthrust::device_ptr
, thethrust::device
execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.The following execution policies are supported in this version:
thrust::host
thrust::device
thrust::cpp::par
thrust::cuda::par
thrust::omp::par
thrust::tbb::par
-
- Algorithms
free
get_temporary_buffer
malloc
merge_by_key
partition
with stencilpartition_copy
with stencilreturn_temporary_buffer
set_difference_by_key
set_intersection_by_key
set_symmetric_difference_by_key
set_union_by_key
stable_partition
with stencilstable_partition_copy
with stenciltabulate
New Examples
- uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in
thrust::device_vector
.
Other Enhancements
- Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their
execution_policy
parameter. - Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as
thrust::device
. THRUST_DEVICE_SYSTEM_CPP
has been added as a compile-time target for the device backend.- CUDA
merge
performance is 2-15x faster. - CUDA comparison sort performance is 1.3-4x faster.
- CUDA set operation performance is 1.5-15x faster.
- TBB
reduce_by_key
performance is 80% faster. - Several algorithms have been parallelized with TBB.
- Support for user allocators in vectors has been improved.
- The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
- Warnings have been eliminated in various contexts.
- Warnings about
__host__
or__device__
-only functions called from__host__ __device__
functions have been eliminated in various contexts. - Documentation about algorithm requirements have been improved.
- Simplified the minimal_custom_backend example.
- Simplified the cuda/custom_temporary_allocation example.
- Simplified the cuda/fallback_allocator example.
Bug Fixes
- #248 fix broken
counting_iterator<float>
behavior with OpenMP - #231, #209 fix set operation failures with CUDA
- #187 fix incorrect occupancy calculation with CUDA
- #153 fix broken multigpu behavior with CUDA
- #142 eliminate warning produced by
thrust::random::taus88
and MSVC 2010 - #208 correctly initialize elements in temporary storage when necessary
- #16 fix compilation error when sorting bool with CUDA
- #10 fix ambiguous overloads of
reinterpret_tag
Known Issues
g++
versions 4.3 and lower may fail to dispatchthrust::get_temporary_buffer
correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
Acknowledgments
- Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
- Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
- Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.