Releases: JuliaGPU/CUDA.jl
Releases · JuliaGPU/CUDA.jl
v5.0.0
CUDA v5.0.0
Blog post: https://info.juliahub.com/cuda-jl-5-0-changes
This is a breaking release, but the breaking changes are minimal (see the blog post for details):
- Julia 1.8 is now required, and only CUDA 11.4+ is supported
- selection of local toolkits has changed slightly
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
Closed issues:
v4.4.1
CUDA v4.4.1
Closed issues:
- CUDA driver device support does not match toolkit (#70)
- Launching kernels should not allocate (#66)
- sync_threads() appears to not be sync'ing threads (#61)
- Exception when using CuArrays with Flux (#129)
- Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
- Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
- Improve 'VS C++ redistributable' error message (#764)
- CUSPARSE does not support reductions (#1406)
- CUDA test failed (#1690)
- Type constructor in broadcast doesn't compile (#1761)
- accumulate(+) gives different results for CuArray compared to Array. (#1810)
- Compat driver: preload all libraries (#1859)
- Stream synchronization is slow when waiting on the event from CUDA (#1910)
- cuDNN: Store convolution algorithm choice to disk. (#1947)
- Disable 'No CUDA-capable device found' error log (#1955)
- CUDNN_STATUS_NOT_SUPPORTED using 1D CNN model (#1977)
- Memory allocations during in-place sparse matrix-vector multiplication (#1982)
CUSPARSE.sum_dim1
sums the absolute values of elements (#1983)- Update to CUDA 12.2 (#1984)
unsafe_wrap
fails on zero element CuArrays (#1985)rand
in kernel works in a deterministic way (#2008)- Scalar indexing with
CuArray * ReshapedArray{SubArray{CuArray}}}
(#2009) - volumerhs performance regression (#2010)
- CuSparseMatrix constructors allocate too much memory? (#2015)
- Native profiler using CUPTI (#2017)
- libLLVM-15jl.so (#2018)
- "symbol multiply defined" error (#2021)
- Confusion on row major vs column major (#2023)
- Printing of CuArrays gives zeros or random numbers (#2033)
sortperm!
fails when output isUInt
vector (#2046)- Re-introduce spinning loop before nonblocking synchronization (#2057)
Merged pull requests:
- Check mathType only if not Float32 (#1943) (@RomeoV)
- 1.10 enablement (#1946) (@dkarrasch)
- Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV)
- Wrapper with tests for
gemmBatchedEx!
(#1975) (@lpawela) - Add wrappers for
gemv_batched!
(#1981) (@lpawela) - Update
CUSPARSE.sum_dim<n>
to allow for arbitrary function on elements (#1987) (@lpawela) - Update manifest (#1988) (@github-actions[bot])
- Add vectorized cached loads (#1993) (@Zentrik)
- Update manifest (#1995) (@github-actions[bot])
- Fix typo in captured macro example (#1996) (@Zentrik)
- Adapt Type call broadcasting to a function (#2000) (@simonbyrne)
- [CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio)
- Update manifest (#2002) (@github-actions[bot])
- Support for printing types. (#2003) (@maleadt)
- Fix accumulate bug (#2005) (@chrstphrbrns)
- Update manifest (#2013) (@github-actions[bot])
- Add a raw mode to code_sass. (#2019) (@maleadt)
- Update manifest (#2022) (@github-actions[bot])
- Add a native profiler. (#2024) (@maleadt)
- Perform synchronization on a worker thread (#2025) (@maleadt)
- Remove broken video link in docs (#2028) (@christiangnrd)
- When freeing memory, use the high-level device getter. (#2029) (@maleadt)
- Add support for @cuda fastmath (#2030) (@maleadt)
- Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer)
- Add support for CUDA 12.2. (#2034) (@maleadt)
- rand: seed kernels from the host. (#2035) (@maleadt)
- Update wrappers for CUDA 12.2. (#2039) (@maleadt)
- On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt)
- Delay all initialization errors until run time. (#2041) (@maleadt)
- JLL/CI/Julia changes. (#2042) (@maleadt)
- Add support for NVTX events to the integrated profiler. (#2043) (@maleadt)
- Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt)
- Add some more fastmath functions (#2047) (@Zentrik)
- Fixup wrong key lookup. (#2048) (@RomeoV)
- Update manifest (#2049) (@github-actions[bot])
- Make sortperm! resilient to type mismatches. (#2051) (@maleadt)
- Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt)
- enable dependabot for GitHub actions (#2054) (@ranocha)
- Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot])
- Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot])
- Rework how local toolkits are selected. (#2058) (@maleadt)
- Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)
v4.4.0
CUDA v4.4.0
Closed issues:
- Unreachable control flow leads to illegal divergent barriers (#1746)
- CUBLAS fails on new CUDA.jl v4 (#1852)
- Sort fails on Lovelace (sm8.9) GPUs (#1874)
- gesvd! crashes on Pascal and v12.0 (#1932)
- No effect for calling "nsys launch" (#1938)
- Basic math operations with nested adjoint and transpose (#1940)
- CPU and GPU implementations return results at dissimilar scales, even in double precision arithmetics (#1950)
- Failed CUDA.jl initialization breaks Flux? (#1952)
- Recent
mul!
changes break multiplication with matrices that haveStaticArray
elements (#1953) - Test infrastructure: define test groups (#1961)
- Strange
rand
errors when sampling large matrices (#1963) - Add aqua tests (#1964)
- Support of Orin GPU from Nvidia ? (#1966)
- Crash in LLVM (#1971)
- Warning cuDNN Convolution (#1972)
- Strange behaviour when installed at system level (#1973)
Merged pull requests:
- Update benchmarks for 1.8 and 1.9 (#1933) (@maleadt)
- CUSOLVER: Explicitly pass NULL when not requesting svd outputs. (#1934) (@maleadt)
- Detect and complain about loading system libraries. (#1935) (@maleadt)
- Update manifest (#1936) (@github-actions[bot])
- Avoid stack overflow with eary OOM reporting. (#1937) (@maleadt)
- [CUSPARSE] Improved support for UniformScaling ad Diagonal (#1941) (@albertomercurio)
- Update manifest (#1949) (@github-actions[bot])
- Update GPUCompiler to fix unreachable control flow. (#1951) (@maleadt)
- Allow StaticArray eltype in matmat{vec,mul} (#1954) (@lcw)
- Bump CUDNN to v8.9. (#1959) (@maleadt)
- Bump CUTENSOR to v1.7. (#1960) (@maleadt)
- Add and fix some aqua tests (#1965) (@charleskawczynski)
- Fix compatibility of CUDA 11.4 to support Orin. (#1967) (@maleadt)
- Don't use Int32 indices in rand kernels. (#1969) (@maleadt)
- CI simplifications (#1970) (@maleadt)
- Use Base.pkgversion on 1.9. (#1974) (@maleadt)
- Update to LLVM.jl 6. (#1976) (@maleadt)
- fix launch config bug in bitonic sort (#1979) (@xaellison)
- Update manifest (#1980) (@github-actions[bot])
v4.3.2
v4.3.1
CUDA v4.3.1
Closed issues:
- Array testsuite compiles kernel with large types (#1902)
- CUDA.jl v4 installs CUDA runtime despite version=local (#1922)
- Occaisonal "CUSOLVERError: an internal operation failed (code 7, CUSOLVER_STATUS_INTERNAL_ERROR)" (#1924)
- Does [email protected] need [email protected]? (#1929)
Merged pull requests:
v4.3.0
CUDA v4.3.0
Closed issues:
- Multidimensional
reverse
(#1126) - Test errors on master (#1866)
- Integer overflow error with svd for large matrix (#1880)
- Erratic behaviour of
CUDA.jl
if used in the REPL of VSCode. (#1892) - QR decomposition requires scalar indexing (#1893)
- BSOD during package tests (#1898)
- Insufficient coverage of CuArrays in the documentation (#1901)
- Failed to compile with Julia v1.9 on PowerPC (#1911)
- CUDA test failed in wmma.jl (#1914)
- Fix deprecation warnings (#1920)
Merged pull requests:
- CUSOLVER: Fix workspace size passing. (#1890) (@maleadt)
- Lovelace fixes (#1894) (@maleadt)
- Update manifest (#1897) (@github-actions[bot])
- Reverse with multiple dimensions (#1899) (@RainerHeintzmann)
- Restrict number of test jobs based on available memory. (#1900) (@maleadt)
- Avoid unneeded macros to cut down on generated code (#1905) (@maleadt)
- Avoid unneeded macros to cut down on generated code (#1906) (@maleadt)
- Update manifest (#1907) (@github-actions[bot])
- Bump GPUCompiler. (#1908) (@maleadt)
- Don't use Float64 atomics on unsupported platforms. (#1912) (@maleadt)
- Report package versions as part of versioninfo(). (#1913) (@maleadt)
- Align variables in constant memory by 256 bit (#1915) (@Zentrik)
- Add norm functions for 3 floats (#1916) (@Zentrik)
- cuDNN: only choose conv algorithms if they match descriptor mathType (#1917) (@ToucheSir)
- Update manifest (#1918) (@github-actions[bot])
- Skip Integer WMMA tests on older devices. (#1919) (@maleadt)
v4.2.0
CUDA v4.2.0
Closed issues:
- NVTX: consider using Start/End for ranges (#1485)
- Limitations of
CuIterator
(#1768) - Testing fails on unsupported devices. (#1815)
- Local runtime discovery does not work for external libraries (CUDNN, CUTENSOR) (#1850)
- Passing tests using Github CI workflow errors with
libcuda not defined
(#1867) - Cannot precompile GPU code with SnoopPrecompile (#1870)
- Incorrect kernel execution with bounds checking using Julia 1.9.0-rc2 (#1875)
- Fake CUDA library (#1879)
- Error thrown when launching Julia with Nsight systems or compute. (#1886)
- Cannot construct CuDeviceArray (#1887)
- Incorrect colVal array when using CuSparseMatrixCSR command on sparse matrix (#1888)
Merged pull requests:
- Use
adapt
symmetrically inCuIterator
(#1769) (@mcabbott) - Allow but warn when testing on not fully-supported devices. (#1818) (@maleadt)
- Support runtime discovery for non-toolkit libraries (CUTENSOR, CUDNN, CUQUANTUM) (#1858) (@mloubout)
- Add KernelAbstractions.jl unsafe_free! (#1863) (@pxl-th)
- Allow precompiling CUDA code. (#1865) (@maleadt)
- Assert CUDA.jl is functional when creating the TLS. (#1868) (@maleadt)
- Update manifest (#1871) (@github-actions[bot])
- Don't collect
AbstractQ
objects in tests (#1872) (@dkarrasch) - Add compatibility entry for Lovelace (#1873) (@xaellison)
- remove some type-piracy from cusparse (#1876) (@vtjnash)
- Remove more unneeded ndims methods. (#1878) (@maleadt)
- Guard the initialization-time CUDA driver check in a try/catch. (#1881) (@maleadt)
- Update manifest (#1882) (@github-actions[bot])
- Update CUDA 12.1 to 12.1.1. (#1883) (@maleadt)
- Use atomics for allocation statistics. (#1884) (@maleadt)
- Fix atomic increment of alloc stats. (#1885) (@maleadt)
- Update manifest (#1889) (@github-actions[bot])
v4.1.4
CUDA v4.1.4
Closed issues:
- Buggy precompilation of init-defined symbols can break CUDA_Driver_jll initialization (#1798)
- Calling CUDA.set_runtime_version!() with float parameter makes CUDA.jl unusable. (#1831)
- Unexpexted memory allocation when using
randn!
(#1856) - The memory copy speed seems to exceed the hardware limit (#1860)
- PCG produces different output on GPU (via Krylov.jl) (#1864)
Merged pull requests:
v4.1.3
v4.1.2
CUDA v4.1.2
Closed issues:
- Flux's
gradient
differentiatingrfft
leads to non-bit error (#1835)
Merged pull requests:
- switch to using defined globals (#1832) (@simonbyrne)
- Update manifest (#1837) (@github-actions[bot])