Releases · JuliaGPU/CUDA.jl

15 May 19:28

github-actions

v5.3.4

c373258

v5.3.4

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
Handle cache improvements (#2352) (@maleadt)
Fix cuTensorNet compat (#2354) (@maleadt)
Optimize array allocation. (#2355) (@maleadt)
Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
Make generic_trimatmul more specific (#2359) (@tgymnich)
Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
Enzyme: Forward mode sync (#2369) (@wsmoses)
Enzyme: support fill (#2371) (@wsmoses)
unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
Remove external_gvars. (#2373) (@maleadt)
Tegra support with artifacts (#2374) (@maleadt)
Backport Enzyme extension (#2375) (@wsmoses)
Add note about --check-bounds=yes (#2378) (@Zinoex)
Test Enzyme in a separate CI job. (#2379) (@maleadt)
Fix tests for Tegra. (#2381) (@maleadt)
Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)

Closed issues:

Native Softmax (#175)
CUSOLVER: support eigendecomposition (#173)
backslash with gpu matrices crashes julia (#161)
at-benchmark captures GPU arrays (#156)
Support kernels returning Union{} (#62)
mul! falls back to generic implementation (#148)
\ on qr factorization objects gives a method error (#138)
Compiler failure if dependent module only contains a japi1 function (#49)
copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
Calling Flux.gpu on a view dumps core (#125)
Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
Guard against exceeding maximum kernel parameter size (#32)
Detect common API misuse in error handlers (#31)
rand and friends default to Float64 (#108)
\ does not work for least squares (#104)
ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
CuIterator assumes batches to consist of multiple arrays (#86)
Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
Document (un)supported language features for kernel programming (#13)
Missing dispatch for indexing of reshaped arrays (#556)
Track array ownership to avoid illegal memory accesses (#763)
NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
Support for sm_80 cp.async: asynchronous on-device copies (#850)
Profiling Julia with Nsight Systems on Windows results in blank window (#862)
sort! and partialsort! are considerably slower than CPU versions (#937)
mul! does not dispatch on Adjoint (#1363)
Cross-device copy of wrapped arrays fails (#1377)
Memory allocation becomes very slow when reserved bytes is large (#1540)
Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
device_reset! does not seem to work anymore (#1579)
device-side rand() are not random between successive kernel launches (#1633)
Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2 not defined (#1820)
Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
KernelAbstractions.jl-related issues (#1838)
lock failing in multithreaded plan_fft() (#1921)
CUSolver finalizer tries to take ReentrantLock (#1923)
Testsuite could be more careful about parallel testing (#2192)
Opportunistic GC collection (#2303)
Unable to use local CUDA runtime toolkit (#2367)
Enzyme prevents testing on 1.11 (#2376)

Contributors

maleadt, wsmoses, and 5 other contributors

Assets 2

27 Apr 10:11

github-actions

v5.3.3

50137ae

v5.3.3

CUDA v5.3.3

Diff since v5.3.2

Merged pull requests:

Rework context handling (#2346) (@maleadt)
fix kernel launch logic (#2353) (@xaellison)

Closed issues:

Excessive allocations when running on multiple threads (#1429)
Fix and test multigpu support (#2218)
Bitonic sort exceeds launch resources (#2331)

Contributors

maleadt and xaellison

Assets 2

26 Apr 13:59

github-actions

v5.3.2

e2e7b57

v5.3.2

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
Consider running GC when allocating and synchronizing (#2304) (@maleadt)
Refactor memory wrappers (#2335) (@maleadt)
Auto-detect external profilers. (#2339) (@maleadt)
Fix performance of indexing unified memory. (#2340) (@maleadt)
Improve exception output (#2342) (@maleadt)
Test multigpu on CI (#2348) (@maleadt)
cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

Closed issues:

CuArrays don't seem to display correctly in VS code (#875)
Task scheduling can result in delays when synchronizing (#1525)
Docs: add example on task-based parallelism with explicit synchronization (#1566)
Exception output from many threads is not helpful (#1780)
Autodetect external profiler (#2176)
LazyInitialized is not GC-safe (#2216)
Track CuArray stream usage (#2236)
Improve cross-device usage (#2323)
CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
Improve error message when assigning real valued arrray with complex numbers (#2341)
@device_code_sass broken (#2343)
Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
@gcsafe_ccall breaks inlining of ccall wrappers (#2347)

Contributors

vchuravy and maleadt

Assets 2

19 Apr 07:16

github-actions

v5.3.1

9c9a05f

v5.3.1

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

[CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
Regenerate headers (#2324) (@maleadt)
Add some installation tips to docs/README.md (#2326) (@jlchan)
fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
Diagnose kernel limits on launch failure. (#2329) (@maleadt)
Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

Closed issues:

Missing CUBLASLt wrappers (#2322)
error when switching device (#2323)
v5.3.0: regression in Zygote performance (#2333)

Contributors

maleadt, jlchan, and 2 other contributors

Assets 2

12 Apr 14:27

github-actions

v5.3.0

5da4d1d

v5.3.0

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests:

CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
Slightly rework error handling (#2245) (@maleadt)
cuTENSOR improvements (#2246) (@maleadt)
Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
Improve Tegra detection. (#2251) (@maleadt)
Added few SparseArrays functions (#2254) (@albertomercurio)
Reduce locking in the handle cache (#2256) (@maleadt)
Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
Re-generate headers. (#2265) (@maleadt)
Update to CUDNN 9. (#2267) (@maleadt)
[CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
Add extension package for StaticArrays (#2273) (@trahflow)
Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
Cached workspace prototype for custatevec (#2279) (@kshyatt)
Update the Julia wrappers for v12.4 (#2282) (@amontoison)
Add support for CUDA 12.4. (#2286) (@maleadt)
Test suite changes (#2288) (@maleadt)
Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
Towards supporting Julia 1.11 (#2291) (@maleadt)
Fix typo in performance tips (#2294) (@Zentrik)
Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
Avoid OOMs during OOM handling. (#2299) (@maleadt)
[CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
[CUSOLVER] Interface larft! (#2301) (@amontoison)
Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
sortperm with dims (#2308) (@xaellison)
[CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
[CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Avoid capturing AbstractArrays in BoundsError (#2314) (@lcw)
Clarify debug level hint. (#2316) (@maleadt)

Closed issues:

Failed to compile PTX code when using NSight on Win11 (#1601)
sortperm fails with dims keyword (#2061)
NVTX-related segfault on Windows under compute-sanitizer (#2204)
Inverse Complex-to-Real FFT allocates GPU memory (#2249)
cuDNN not available for your platform (#2252)
Cannot reset CuArray to zero (#2257)
Cannot take gradient of sort on 2D CuArray (#2259)
Multi-threaded code hanging forever with Julia 1.10 (#2261)
CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
Adjoint not supported on Diagonal arrays (#2275)
Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
Release v5.3? (#2283)
Wrap CUDSS? (#2287)
Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
BUG: sortperm! seems to perform much slower than it should (#2293)
Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
BFloat16 support broken on Julia 1.11 (#2306)
does not emit line info for debbuging/profiling (#2312)
Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

Contributors

lcw, vchuravy, and 11 other contributors

Assets 2

04 Apr 09:27

github-actions

v4.4.2

03c5f72

v4.4.2

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Added support for more transform directions (#1903) (@RainerHeintzmann)
CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
Add some performance tips to the documentation (#1999) (@Zentrik)
Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
Adapt to GPUCompiler#master. (#2062) (@maleadt)
Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
Use released GPUCompiler. (#2064) (@maleadt)
Fixes for Windows. (#2065) (@maleadt)
Switch to GPUArrays buffer management. (#2068) (@maleadt)
Update CUDA 12 to Update 2. (#2071) (@maleadt)
[CUSOLVER] Add generic routines (#2074) (@amontoison)
Update manifest (#2076) (@github-actions[bot])
Test improvements (#2079) (@maleadt)
Rework and extend the cooperative groups API. (#2081) (@maleadt)
Update manifest (#2082) (@github-actions[bot])
[CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
Fix some typos in perfomance tips (#2086) (@Zentrik)
Improve PTX ISA selection (#2088) (@maleadt)
Update manifest (#2090) (@github-actions[bot])
support ChainRulesCore inplaceability (#2091) (@piever)
Add a method inv(CuMatrix) (#2095) (@amontoison)
Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
Add a with_workspaces function (#2099) (@amontoison)
[CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
[CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
Call exit when handling exceptions. (#2103) (@maleadt)
Bump packages. (#2104) (@maleadt)
Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
Update manifest (#2107) (@github-actions[bot])
Make Ref mutable on the GPU. (#2109) (@maleadt)
CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
Small profiler improvements (#2113) (@maleadt)
Update manifest (#2114) (@github-actions[bot])
[CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
[CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
[CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
Update manifest (#2123) (@github-actions[bot])
Profiler: Show used local memory. (#2124) (@maleadt)
Support for CUDA 12.3 (#2125) (@maleadt)
[CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
[CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
[CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
Better support for unified and host memory (#2138) (@maleadt)
Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
Avoid allocations during derived array construction. (#2142) (@maleadt)
More performance tweaks for memory copying (#2143) (@maleadt)
Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
Update documentation (#2146) (@maleadt)
Fixes for sm_61 (#2151) (@maleadt)
Update sparse factorizations (#2152) (@amontoison)
Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Sanitizer improvements. (#2157) (@maleadt)
[CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
[CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
[CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
expand docs on launch parameters (#2167) (@simonbyrne)
Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
[CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
Added kronecker product support for dense matrices (#2177) (@albertomercurio)
Update to CUTENSOR 2.0 (#2178) (@maleadt)
Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
provide more information on kernel compilation error (#2180) (@simonbyrne)
[CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
[CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
[CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
[CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
Support more kwarg syntax with kernel launches (#2189) (@maleadt)
Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
NVML: Add support for clock queries. (#2194) (@maleadt)
Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
Improvements to context handling (#2200) (@maleadt)
Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
Rework unique context management. (#2202) (@maleadt)
Preserve the buffer type when broadcasting. (#2203) (@maleadt)
Fixes for Windows (#2206) (@maleadt)
Bump Aqua. (#2207) (@maleadt)
Updates for new CUQUANTUM (#2210) (@kshyatt)
CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
Default to testing with only a single device. (#2221) (@maleadt)
Backports for v5.1 (#2224) (@maleadt)
Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
cuTensor fixes (#2228) (@maleadt)
Bump versions. (#2229) (@maleadt)
Add a note about threaded for-blocks. (#2232) (@kshyatt)
cuTENSOR plan handling changes. (#2234) (@maleadt)
Fix dynamic dispatch issues (#2235) (@MilesCranmer)
CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
Fixes for nightly (#2240) (@maleadt)
CUBLAS: Support more strided inputs (#2242) (@maleadt)
CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
Slightly rework error handling (#2245) (@maleadt)
cuTENSOR improvements (#2246) (@maleadt)
Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
Improve Tegra detection. (#2251) (@maleadt)
Added few SparseArrays functions (#2254) (@albertomercurio)
Reduce locking in the handle cache (#2256) (@maleadt)
Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
Re-generate headers. (#2265) (@maleadt)
Update to CUDNN 9. (#2267) (@maleadt)
[CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
Add extension package for StaticArrays (#2273) (@trahflow)
Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
Cached workspace prototype for custatevec (#2279) (@kshyatt)
Update the Julia wrappers for v12.4 (#2282) (@amontoison)
Add support for CUDA 12.4. (#2286) (@maleadt)
Test suite changes (#2288) (@maleadt)
Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
Fix typo in performance tips (#2294) (@Zentrik)
Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
Avoid OOMs during OOM handling. (#2299) (@maleadt)
[CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
[CUSOLVER] Interface larft! (#2301) (@amontoison)
Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
[CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
[CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)

Closed issues:

Element-wise conversion to Duals (#127)
IDEA: CuHostArray (#28)
Make Ref pass by-reference (#267)
Failed to compile PTX code when using NSight on Win11 (#1601)
view(data, idx) boundschecking is disproportionately expensive (#1678)
[CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
Trouble using nsight systems for profiling CUDA in Julia (#1779)
dlopen("libcudart") results in duplicate libraries (#1814)
Support for JLD2 (#1833)
Windows Defender mis-labels artifacts as threat (#1836)
Support Cholesky factorization of CuSparseMatrixCSR (#1855)
Runtime not re-selected after driver upgrade (#1877)
Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
Cannot precompile GPU code with PrecompileTools (#2006)
Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
Support for LinearAlgebra.pinv (#2070)
PTX ISA 8.1 support (#2080)
Segmentation fault when importing CUDA (#2083)
"No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
Miss...

Contributors

sync, vchuravy, and 19 other contributors

Assets 2

18 Jan 10:44

github-actions

v5.2.0

5876e9d

v5.2.0

CUDA v5.2.0

Diff since v5.1.2

Merged pull requests:

CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
Update to CUTENSOR 2.0 (#2178) (@maleadt)
Updates for new CUQUANTUM (#2210) (@kshyatt)
Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
cuTensor fixes (#2228) (@maleadt)
Bump versions. (#2229) (@maleadt)
Add a note about threaded for-blocks. (#2232) (@kshyatt)
cuTENSOR plan handling changes. (#2234) (@maleadt)
Fix dynamic dispatch issues (#2235) (@MilesCranmer)
CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
Fixes for nightly (#2240) (@maleadt)
CUBLAS: Support more strided inputs (#2242) (@maleadt)

Closed issues:

Trouble using nsight systems for profiling CUDA in Julia (#1779)
Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
First test for Julia/CUDA with 15 failures (#2158)
Update to CUTENSOR 2.0 (#2174)
Tests fail for CUDA#master (#2223)
Test failures on Nvidia GH200 (#2227)
mul! should support strided outputs (#2230)
Please add support for older cuda versions (cuda 8 and older) (#2231)
NSight Compute: prevent API calls during precompilation (#2233)
Integrated profiler: detect lack of permissions (#2237)

Contributors

maleadt, kshyatt, and 2 other contributors

Assets 2

07 Jan 10:34

github-actions

v5.1.2

fc99b1d

v5.1.2

CUDA v5.1.2

Diff since v5.1.1

Merged pull requests:

kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
[CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
Added kronecker product support for dense matrices (#2177) (@albertomercurio)
Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
provide more information on kernel compilation error (#2180) (@simonbyrne)
[CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
[CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
[CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
[CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
Support more kwarg syntax with kernel launches (#2189) (@maleadt)
Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
NVML: Add support for clock queries. (#2194) (@maleadt)
Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
Improvements to context handling (#2200) (@maleadt)
Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
Rework unique context management. (#2202) (@maleadt)
Preserve the buffer type when broadcasting. (#2203) (@maleadt)
Fixes for Windows (#2206) (@maleadt)
Bump Aqua. (#2207) (@maleadt)
CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
Default to testing with only a single device. (#2221) (@maleadt)
Backports for v5.1 (#2224) (@maleadt)

Closed issues:

More informative errors when parameter size is too big (#2119)
Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
Support for combining duplicate elements in sparse matrices (#2185)
Interactive sessions: periodically trim the memory pool (#2190)
Broadcast does not preserve buffer type (#2191)
CUDA doesn't precompile on Julia nightly/1.11 (#2195)
Latest julia: UndefVarError: make_seed not defined in Random (#2198)
CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
Most recent package versions not supported on CUDA.jl (#2212)
Testing of CUDA fails (#2222)
--debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)

Contributors

maleadt, jcsahnwaldt, and 5 other contributors

Assets 2

20 Nov 11:38

github-actions

v5.1.1

ffcd7e3

v5.1.1

CUDA v5.1.1

Diff since v5.1.0

Merged pull requests:

Sanitizer improvements. (#2157) (@maleadt)
[CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
[CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
[CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
expand docs on launch parameters (#2167) (@simonbyrne)
Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)

Closed issues:

High CPU load during GPU syncronization (#2161)

Contributors

maleadt, simonbyrne, and amontoison

Assets 2

07 Nov 15:10

github-actions

v5.1.0

6daddc2

v5.1.0

CUDA v5.1.0

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.

Diff since v5.0.0

Merged pull requests:

[CUSOLVER] Add generic routines (#2074) (@amontoison)
Rework and extend the cooperative groups API. (#2081) (@maleadt)
[CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
Fix some typos in perfomance tips (#2086) (@Zentrik)
Improve PTX ISA selection (#2088) (@maleadt)
Update manifest (#2090) (@github-actions[bot])
support ChainRulesCore inplaceability (#2091) (@piever)
Add a method inv(CuMatrix) (#2095) (@amontoison)
Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
Add a with_workspaces function (#2099) (@amontoison)
[CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
[CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
Call exit when handling exceptions. (#2103) (@maleadt)
Bump packages. (#2104) (@maleadt)
Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
Update manifest (#2107) (@github-actions[bot])
Make Ref mutable on the GPU. (#2109) (@maleadt)
CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
Small profiler improvements (#2113) (@maleadt)
Update manifest (#2114) (@github-actions[bot])
[CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
[CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert)
[CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
Update manifest (#2123) (@github-actions[bot])
Profiler: Show used local memory. (#2124) (@maleadt)
Support for CUDA 12.3 (#2125) (@maleadt)
[CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
[CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
[CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
Better support for unified and host memory (#2138) (@maleadt)
Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
Avoid allocations during derived array construction. (#2142) (@maleadt)
More performance tweaks for memory copying (#2143) (@maleadt)
Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
Update documentation (#2146) (@maleadt)
Fixes for sm_61 (#2151) (@maleadt)
Update sparse factorizations (#2152) (@amontoison)
Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)

Closed issues:

Element-wise conversion to Duals (#127)
IDEA: CuHostArray (#28)
Make Ref pass by-reference (#267)
view(data, idx) boundschecking is disproportionately expensive (#1678)
[CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
dlopen("libcudart") results in duplicate libraries (#1814)
Support for JLD2 (#1833)
Windows Defender mis-labels artifacts as threat (#1836)
Support Cholesky factorization of CuSparseMatrixCSR (#1855)
Runtime not re-selected after driver upgrade (#1877)
Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
Cannot precompile GPU code with PrecompileTools (#2006)
CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
PTX ISA 8.1 support (#2080)
Segmentation fault when importing CUDA (#2083)
"No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
Binaries for Jetson (#2105)
Minimum/maximum of array of NaNs is infinity (#2111)
Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
[CUBLAS] Regenerate the wrappers with updated argument types (#2115)
Unable to allocate unified memory buffers (#2120)
CUDA 12.3 has been released (#2122)
atomic min, max for Float32 and Float64 (#2129)
Native profiler output is limited to around 100 columns when printing to a file (#2130)
LLVM generates max.NaN which only works on sm_80 (#2148)
Unified memory-related error on Tegra T194 (#2149)
Errors on sm_61 (#2150)

Contributors

maleadt, piever, and 4 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA v5.3.4

Contributors

CUDA v5.3.3

Contributors

CUDA v5.3.2

Contributors

CUDA v5.3.1

Contributors

CUDA v5.3.0

Contributors

CUDA v4.4.2

Contributors

CUDA v5.2.0

Contributors

CUDA v5.1.2

Contributors

CUDA v5.1.1

Contributors

CUDA v5.1.0

Contributors

Releases: JuliaGPU/CUDA.jl

v5.3.4

CUDA v5.3.4

Contributors

v5.3.3

CUDA v5.3.3

Contributors

v5.3.2

CUDA v5.3.2

Contributors

v5.3.1

CUDA v5.3.1

Contributors

v5.3.0

CUDA v5.3.0

Contributors

v4.4.2

CUDA v4.4.2

Contributors

v5.2.0

CUDA v5.2.0

Contributors

v5.1.2

CUDA v5.1.2

Contributors

v5.1.1

CUDA v5.1.1

Contributors

v5.1.0

CUDA v5.1.0

Contributors