Releases: JuliaGPU/CUDA.jl
Releases Β· JuliaGPU/CUDA.jl
v5.3.4
CUDA v5.3.4
Merged pull requests:
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)
Closed issues:
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a
japi1
function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating
CuArray{Tracker.TrackedReal{Float64},1}
a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
rand
and friends default toFloat64
(#108)- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for
sm_80
cp.async
: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2
not defined (#1820)- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)
v5.3.3
CUDA v5.3.3
Merged pull requests:
- Rework context handling (#2346) (@maleadt)
- fix kernel launch logic (#2353) (@xaellison)
Closed issues:
v5.3.2
CUDA v5.3.2
Merged pull requests:
- Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
- Consider running GC when allocating and synchronizing (#2304) (@maleadt)
- Refactor memory wrappers (#2335) (@maleadt)
- Auto-detect external profilers. (#2339) (@maleadt)
- Fix performance of indexing unified memory. (#2340) (@maleadt)
- Improve exception output (#2342) (@maleadt)
- Test multigpu on CI (#2348) (@maleadt)
- cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
- cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)
Closed issues:
- CuArrays don't seem to display correctly in VS code (#875)
- Task scheduling can result in delays when synchronizing (#1525)
- Docs: add example on task-based parallelism with explicit synchronization (#1566)
- Exception output from many threads is not helpful (#1780)
- Autodetect external profiler (#2176)
- LazyInitialized is not GC-safe (#2216)
- Track CuArray stream usage (#2236)
- Improve cross-device usage (#2323)
- CUBLASLt wrapper for
cublasLtMatmulDescSetAttribute
can have device buffers as input (#2337) - Improve error message when assigning real valued arrray with complex numbers (#2341)
@device_code_sass
broken (#2343)- Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
@gcsafe_ccall
breaks inlining of ccall wrappers (#2347)
v5.3.1
CUDA v5.3.1
Merged pull requests:
- [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
- Regenerate headers (#2324) (@maleadt)
- Add some installation tips to docs/README.md (#2326) (@jlchan)
- fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
- Diagnose kernel limits on launch failure. (#2329) (@maleadt)
- Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)
Closed issues:
v5.3.0
CUDA v5.3.0
Merged pull requests:
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Towards supporting Julia 1.11 (#2291) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- sortperm with dims (#2308) (@xaellison)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
- Avoid capturing
AbstractArray
s inBoundsError
(#2314) (@lcw) - Clarify debug level hint. (#2316) (@maleadt)
Closed issues:
- Failed to compile PTX code when using NSight on Win11 (#1601)
sortperm
fails withdims
keyword (#2061)- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of
sort
on 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError
trying to throwOutOfGPUMemoryError
, subsequent errors (#2292)- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying
CuSparseMatrixCSC
byCuMatrix
results inOut of GPU memory
(#2296) - BFloat16 support broken on Julia 1.11 (#2306)
- does not emit line info for debbuging/profiling (#2312)
- Kernel using
StaticArray
compiles in julia v1.9.4 but not in v1.10.2 (#2313) - Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)
v4.4.2
CUDA v4.4.2
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- Failed to compile PTX code when using NSight on Win11 (#1601)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact
not working in kernels in Julia 1.10.0-beta2 (#2069)- Support for LinearAlgebra.pinv (#2070)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Miss...
v5.2.0
CUDA v5.2.0
Merged pull requests:
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
Closed issues:
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
- First test for Julia/CUDA with 15 failures (#2158)
- Update to CUTENSOR 2.0 (#2174)
- Tests fail for CUDA#master (#2223)
- Test failures on Nvidia GH200 (#2227)
- mul! should support strided outputs (#2230)
- Please add support for older cuda versions (cuda 8 and older) (#2231)
- NSight Compute: prevent API calls during precompilation (#2233)
- Integrated profiler: detect lack of permissions (#2237)
v5.1.2
CUDA v5.1.2
Merged pull requests:
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
Closed issues:
- More informative errors when parameter size is too big (#2119)
- Modifying
struct
containingCuArray
fails in threads in 5.0.0 and 5.1.0 (#2171) - Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError:
make_seed
not defined inRandom
(#2198) - CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
--debug-info=2
makesNNlibCUDACUDNNExt
precompilation run forever (#2225)
v5.1.1
CUDA v5.1.1
Merged pull requests:
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
Closed issues:
- High CPU load during GPU syncronization (#2161)
v5.1.0
CUDA v5.1.0
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.
Merged pull requests:
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for
CUDA.@elapsed
(#2118) (@thomasfaingnaert) - [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple
@sync
copyto! on CUDA v5 (#2112) - [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- LLVM generates max.NaN which only works on sm_80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm_61 (#2150)