v5.6.0
CUDA v5.6.0
CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).
Features
- Update to CUDA 12.6.2 (#2512)
- CUSOLVER: support for
Xgeev!
(#2513),XsyevBatched
(#2577),gesv!
andgels!
(#2406) - CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
- Improve handle cache performance in the presence of many short-lived tasks (#2583)
- CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
- Improved batched pointer conversion for very large batches (#2608)
Bug fixes
- Fix
findall
with an empty CuArray (#2554) - CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
- CUSOLVER: Fix
Xgesvdr!
(#2556) - Preserve the array buffer type with more linear algebra operations (#2534)
Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular(l/r)mul!
- (#2585) - Fix ambiguity of
LinearAlgebra.dot
(#2569) - Native RNG: Fixes when working with very large arrays (#2561)
- Avoid a deadlock due do union splitting in the
mapreduce
kernel (#2595) - Fix pinning of resized CPU memory by automatically re-pinning (#2599)
Merged pull requests:
- [CUSOLVER] Interface gesv! and gels! (#2406) (@amontoison)
- Update wrappers for CUDA v12.6.2 (#2512) (@amontoison)
- [CUSOLVER] Interface Xgeev! (#2513) (@amontoison)
- Added multiplication of transpose / adjoint matrices by diagonal matrices (#2518) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 1, (keep existing compat) (#2521) (@github-actions[bot])
- Adapt to GPUArrays.jl transition to KernelAbstractions.jl. (#2524) (@maleadt)
- Switch CI to 1.11. (#2525) (@maleadt)
- CUTENSOR: Reduce amount of broadcasts compiled during tests. (#2527) (@maleadt)
- CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. (#2528) (@maleadt)
- Clarify the synchronize(ctx)/device_synchronize() docstrings (#2532) (@JamesWrigley)
- Issue #2533: Preserving the buffer type in linear algebra (#2534) (@kmp5VT)
- Clarify description of how
LocalPreferences.toml
is generated in the docs (#2535) (@glwagner) - Adapt to JuliaGPU/GPUArrays.jl#567. (#2537) (@maleadt)
- Removed allocations for transpose/adjoint - diagonal multiplications (#2538) (@RedRussianBear)
- Consistent use of Nsight Compute (#2541) (@huiyuxie)
- Fix formatting in profiling docs page (#2543) (@efaulhaber)
- Fix typo in EnzymeCoreExt.jl (#2550) (@wsmoses)
- Enhance warning under a profiler (#2552) (@huiyuxie)
- Fix findall with an empty CuArray of Bool (#2554) (@amontoison)
- [CUSOLVER] Fix Xgesvdr! (#2556) (@amontoison)
- Test restore Enzyme.jl (#2557) (@wsmoses)
- Native RNG fixes for very large arrays (#2561) (@maleadt)
- [Enzyme] Mark launch_configuration as inactive (#2563) (@wsmoses)
- Update EnzymeCoreExt.jl (#2565) (@simenhu)
- Fix ambiguity of LinearAlgebra.dot (#2569) (@amontoison)
- [CUSOLVER] Add more tests for the dense SVD (#2574) (@amontoison)
- [CUSOLVER] Interface XsyevBatched (#2577) (@amontoison)
- [CUFFT] Preallocate a buffer for complex-to-real FFT (#2578) (@amontoison)
- Run the GC when failing to find a handle, but lots are active. (#2583) (@maleadt)
- Work around LinearAlgebra.jl breakage in 1.11.2. (#2585) (@maleadt)
- mapreduce: avoid deadlock by forcing the accumulator type. (#2596) (@maleadt)
- Switch to GitHub Actions-based benchmarks. (#2597) (@maleadt)
- Re-pin variable sized memory (#2599) (@jipolanco)
- Enzyme: add make_zero of cuarrays (#2600) (@wsmoses)
- Update cache.jl (#2604) (@jarbus)
- Enzyme: mark device_sync as non-differentiable [only downstream] (#2605) (@wsmoses)
- Move strided batch pointer conversion to GPU (#2608) (@THargreaves)
- Split linalg tests into multiple files (#2609) (@kshyatt)
Closed issues:
- Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258)
- Kron Support for CuSparseMatrixCSC (#2370)
- Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514)
- CuArray view has different variable type outside x inside the cuda kernel (#2516)
- Can't build cuDNN on centos7.8 (#2517)
- Precompile errors (#2519)
- Precompile errors (#2520)
- Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522)
- Broadcasting over random static array errors on Julia 1.11 (#2523)
gemm_strided_batched
only using strided CUDA kernel when first matrix is transposed (#2529)- CUDA runtime libraries are loaded from a system path due to LD_LIBRARY_PATH being set (#2530)
- [Bug]
UnifiedMemory
buffer changes during LinearAlgebra operations (#2533) - Improve system library warning when running under profiler (#2540)
- Local CUDA settings not propagated to Pkg.test (#2545)
- Out of Memory when working with Distributed for Small Matricies (#2548)
- findall is not working with an empty vector of bool (#2553)
- CUDA code does not return when running under VSC Debugging mode (#2558)
- dot is quite slow in multinest Arrays (#2559)
- UndefVarError:
backend
not defined inGPUArrays
(#2564) - view() returns CuArray instead of view for 1-D CuArrays (#2566)
- dot ambiguity (#2568)
- InvalidIRError thrown only if critical function is not previously compiled (#2573)
- circular dependency during precompilation (#2579)
- Sparse MatVec Is Nondeterministic? (#2582)
- CUDA triggers long Circular dependency list (#2586)
- Release v5.5.3 for GPUArray v11? (#2587)
- 'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589)
- Scalar indexing when performing
kron
on twoCuVector
s (#2591) - Faster strided-batched to batched wrapper (#2592)
- Error when copying data to pinned and resized CPU array (#2594)
- mapreducedim! size-dependent fail when narrowing float element types (#2595)
- Missing
Enzyme.make_zero
in Enzyme extension leads to incorrect behaviour (#2598) - 'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603)
- Do a release as current one doesn't support
GPUArrays
v11 (#2606)