v5.3.4
CUDA v5.3.4
Merged pull requests:
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)
Closed issues:
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a
japi1
function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating
CuArray{Tracker.TrackedReal{Float64},1}
a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
rand
and friends default toFloat64
(#108)- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for
sm_80
cp.async
: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2
not defined (#1820)- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)