Skip to content

Releases: JuliaGPU/CUDA.jl

v5.3.4

15 May 19:28
c373258
Compare
Choose a tag to compare

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Closed issues:

  • Native Softmax (#175)
  • CUSOLVER: support eigendecomposition (#173)
  • backslash with gpu matrices crashes julia (#161)
  • at-benchmark captures GPU arrays (#156)
  • Support kernels returning Union{} (#62)
  • mul! falls back to generic implementation (#148)
  • \ on qr factorization objects gives a method error (#138)
  • Compiler failure if dependent module only contains a japi1 function (#49)
  • copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
  • Calling Flux.gpu on a view dumps core (#125)
  • Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
  • Guard against exceeding maximum kernel parameter size (#32)
  • Detect common API misuse in error handlers (#31)
  • rand and friends default to Float64 (#108)
  • \ does not work for least squares (#104)
  • ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
  • CuIterator assumes batches to consist of multiple arrays (#86)
  • Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
  • Document (un)supported language features for kernel programming (#13)
  • Missing dispatch for indexing of reshaped arrays (#556)
  • Track array ownership to avoid illegal memory accesses (#763)
  • NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
  • Support for sm_80 cp.async: asynchronous on-device copies (#850)
  • Profiling Julia with Nsight Systems on Windows results in blank window (#862)
  • sort! and partialsort! are considerably slower than CPU versions (#937)
  • mul! does not dispatch on Adjoint (#1363)
  • Cross-device copy of wrapped arrays fails (#1377)
  • Memory allocation becomes very slow when reserved bytes is large (#1540)
  • Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
  • Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
  • device_reset! does not seem to work anymore (#1579)
  • device-side rand() are not random between successive kernel launches (#1633)
  • Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
  • cusparseSetStream_v2 not defined (#1820)
  • Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
  • KernelAbstractions.jl-related issues (#1838)
  • lock failing in multithreaded plan_fft() (#1921)
  • CUSolver finalizer tries to take ReentrantLock (#1923)
  • Testsuite could be more careful about parallel testing (#2192)
  • Opportunistic GC collection (#2303)
  • Unable to use local CUDA runtime toolkit (#2367)
  • Enzyme prevents testing on 1.11 (#2376)

v5.3.3

27 Apr 10:11
Compare
Choose a tag to compare

CUDA v5.3.3

Diff since v5.3.2

Merged pull requests:

Closed issues:

  • Excessive allocations when running on multiple threads (#1429)
  • Fix and test multigpu support (#2218)
  • Bitonic sort exceeds launch resources (#2331)

v5.3.2

26 Apr 13:59
Compare
Choose a tag to compare

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Closed issues:

  • CuArrays don't seem to display correctly in VS code (#875)
  • Task scheduling can result in delays when synchronizing (#1525)
  • Docs: add example on task-based parallelism with explicit synchronization (#1566)
  • Exception output from many threads is not helpful (#1780)
  • Autodetect external profiler (#2176)
  • LazyInitialized is not GC-safe (#2216)
  • Track CuArray stream usage (#2236)
  • Improve cross-device usage (#2323)
  • CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
  • Improve error message when assigning real valued arrray with complex numbers (#2341)
  • @device_code_sass broken (#2343)
  • Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
  • @gcsafe_ccall breaks inlining of ccall wrappers (#2347)

v5.3.1

19 Apr 07:16
9c9a05f
Compare
Choose a tag to compare

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

Closed issues:

  • Missing CUBLASLt wrappers (#2322)
  • error when switching device (#2323)
  • v5.3.0: regression in Zygote performance (#2333)

v5.3.0

12 Apr 14:27
5da4d1d
Compare
Choose a tag to compare

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests:

Closed issues:

  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • sortperm fails with dims keyword (#2061)
  • NVTX-related segfault on Windows under compute-sanitizer (#2204)
  • Inverse Complex-to-Real FFT allocates GPU memory (#2249)
  • cuDNN not available for your platform (#2252)
  • Cannot reset CuArray to zero (#2257)
  • Cannot take gradient of sort on 2D CuArray (#2259)
  • Multi-threaded code hanging forever with Julia 1.10 (#2261)
  • CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
  • Adjoint not supported on Diagonal arrays (#2275)
  • Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
  • Release v5.3? (#2283)
  • Wrap CUDSS? (#2287)
  • Bug concerning broadcast between device array and unified array (#2289)
  • StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
  • BUG: sortperm! seems to perform much slower than it should (#2293)
  • Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
  • BFloat16 support broken on Julia 1.11 (#2306)
  • does not emit line info for debbuging/profiling (#2312)
  • Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
  • Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

v4.4.2

04 Apr 09:27
Compare
Choose a tag to compare

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Closed issues:

  • Element-wise conversion to Duals (#127)
  • IDEA: CuHostArray (#28)
  • Make Ref pass by-reference (#267)
  • Failed to compile PTX code when using NSight on Win11 (#1601)
  • view(data, idx) boundschecking is disproportionately expensive (#1678)
  • [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
  • Trouble using nsight systems for profiling CUDA in Julia (#1779)
  • dlopen("libcudart") results in duplicate libraries (#1814)
  • Support for JLD2 (#1833)
  • Windows Defender mis-labels artifacts as threat (#1836)
  • Support Cholesky factorization of CuSparseMatrixCSR (#1855)
  • Runtime not re-selected after driver upgrade (#1877)
  • Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
  • Cannot precompile GPU code with PrecompileTools (#2006)
  • Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
  • CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
  • StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
  • Support for LinearAlgebra.pinv (#2070)
  • PTX ISA 8.1 support (#2080)
  • Segmentation fault when importing CUDA (#2083)
  • "No system CUDA driver found" on NixOS (#2089)
  • CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
  • Miss...
Read more

v5.2.0

18 Jan 10:44
5876e9d
Compare
Choose a tag to compare

CUDA v5.2.0

Diff since v5.1.2

Merged pull requests:

Closed issues:

  • Trouble using nsight systems for profiling CUDA in Julia (#1779)
  • Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
  • Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
  • First test for Julia/CUDA with 15 failures (#2158)
  • Update to CUTENSOR 2.0 (#2174)
  • Tests fail for CUDA#master (#2223)
  • Test failures on Nvidia GH200 (#2227)
  • mul! should support strided outputs (#2230)
  • Please add support for older cuda versions (cuda 8 and older) (#2231)
  • NSight Compute: prevent API calls during precompilation (#2233)
  • Integrated profiler: detect lack of permissions (#2237)

v5.1.2

07 Jan 10:34
fc99b1d
Compare
Choose a tag to compare

CUDA v5.1.2

Diff since v5.1.1

Merged pull requests:

Closed issues:

  • More informative errors when parameter size is too big (#2119)
  • Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
  • Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
  • Support for combining duplicate elements in sparse matrices (#2185)
  • Interactive sessions: periodically trim the memory pool (#2190)
  • Broadcast does not preserve buffer type (#2191)
  • CUDA doesn't precompile on Julia nightly/1.11 (#2195)
  • Latest julia: UndefVarError: make_seed not defined in Random (#2198)
  • CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
  • Most recent package versions not supported on CUDA.jl (#2212)
  • Testing of CUDA fails (#2222)
  • --debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)

v5.1.1

20 Nov 11:38
ffcd7e3
Compare
Choose a tag to compare

CUDA v5.1.1

Diff since v5.1.0

Merged pull requests:

Closed issues:

  • High CPU load during GPU syncronization (#2161)

v5.1.0

07 Nov 15:10
Compare
Choose a tag to compare

CUDA v5.1.0

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.

Diff since v5.0.0

Merged pull requests:

Closed issues:

  • Element-wise conversion to Duals (#127)
  • IDEA: CuHostArray (#28)
  • Make Ref pass by-reference (#267)
  • view(data, idx) boundschecking is disproportionately expensive (#1678)
  • [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
  • dlopen("libcudart") results in duplicate libraries (#1814)
  • Support for JLD2 (#1833)
  • Windows Defender mis-labels artifacts as threat (#1836)
  • Support Cholesky factorization of CuSparseMatrixCSR (#1855)
  • Runtime not re-selected after driver upgrade (#1877)
  • Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
  • Cannot precompile GPU code with PrecompileTools (#2006)
  • CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
  • PTX ISA 8.1 support (#2080)
  • Segmentation fault when importing CUDA (#2083)
  • "No system CUDA driver found" on NixOS (#2089)
  • CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
  • Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
  • Binaries for Jetson (#2105)
  • Minimum/maximum of array of NaNs is infinity (#2111)
  • Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
  • [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
  • Unable to allocate unified memory buffers (#2120)
  • CUDA 12.3 has been released (#2122)
  • atomic min, max for Float32 and Float64 (#2129)
  • Native profiler output is limited to around 100 columns when printing to a file (#2130)
  • LLVM generates max.NaN which only works on sm_80 (#2148)
  • Unified memory-related error on Tegra T194 (#2149)
  • Errors on sm_61 (#2150)