-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish removing the BigInts from * for FD{Int128}! #94
base: master
Are you sure you want to change the base?
Conversation
We do not here explicitly introduce support for FD{BitIntegers.Int256}, though that should work out of the box both before and after this PR. Rather, this PR _uses_ a (U)Int256 under the hood to prevent allocations from Int128 widening to BigInt in FD operations.
Finally implements the fast-multiplication optimization from #45, but this time for 128-bit FixedDecimals! :) This is a follow-up to #93, which introduces an Int256 type for widemul. However, the fldmod still required 2 BigInt allocations. Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int256, which briefly widens to Int512 (😅) to perform the fldmod by the constant 10^f coefficient. This brings 128-bit FD multiply to the same performance as 64-bit. :)
Huzzah! I finally feel like we can bring the ideas from #45 to this package, and it's still valuable. It turns out that LLVM has already applied this optimization automatically for Int128 div by const (thus * for See here, for my note on FixedDecimal{Int64} support: |
Now that there's julia's effects system, this doesn't need |
82eb32d
to
f2958ba
Compare
Aha, the native code difference comes from the whole benchmark function. Before, the multiply was outlined, and called through a function call. julia> function bench(fd)
for _ in 1:10000
fd = fd * fd
end
fd
end
bench (generic function with 1 method)
# Before:
julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
┌ Warning: /Users/nathandaly/.julia/dev/FixedPointDecimals/src/fldmod-by-const.jl no longer exists, deleted all methods
└ @ Revise ~/.julia/packages/Revise/bAgL0/src/packagedef.jl:666
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1163 ; -- Begin function julia_bench_1163
.p2align 2
_julia_bench_1163: ; @julia_bench_1163
; %bb.0: ; %guard_exit4
sub sp, sp, #48
stp x20, x19, [sp, #16] ; 16-byte Folded Spill
stp x29, x30, [sp, #32] ; 16-byte Folded Spill
ldr w0, [x0]
mov w19, #10000
Lloh0:
adrp x20, "_j_*_1165"@GOTPAGE
Lloh1:
ldr x20, [x20, "_j_*_1165"@GOTPAGEOFF]
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
str w0, [sp, #12]
add x0, sp, #12
add x1, sp, #12
blr x20
subs x19, x19, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit16
ldp x29, x30, [sp, #32] ; 16-byte Folded Reload
ldp x20, x19, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #48
ret
.loh AdrpLdrGot Lloh0, Lloh1
; -- End function
.subsections_via_symbols After: julia> @code_native debuginfo=:none bench(FixedDecimal{UInt32,3}(1.234))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_bench_1161 ; -- Begin function julia_bench_1161
.p2align 2
_julia_bench_1161: ; @julia_bench_1161
; %bb.0: ; %guard_exit7
ldr w0, [x0]
mov w8, #10000
mov x9, #57148
movk x9, #36175, lsl #16
movk x9, #28311, lsl #32
movk x9, #33554, lsl #48
mov x10, #-1000
LBB0_1: ; %L2
; =>This Inner Loop Header: Depth=1
umull x11, w0, w0
umulh x11, x11, x9
lsr x12, x11, #9
mul x13, x12, x10
umaddl x13, w0, w0, x13
ubfx x11, x11, #9, #1
cmp x13, #500
cset w13, hi
csel w11, w11, w13, eq
add w0, w11, w12
subs x8, x8, #1
b.ne LBB0_1
; %bb.2: ; %guard_exit19
ret
; -- End function
.subsections_via_symbols EDIT: And even with a runtime variable for the loop count, so it can't optimize away the loop, it's still just as much faster in the new code: julia> @noinline function bench(fd, N)
for _ in 1:N
fd = fd * fd
end
fd
end
bench (generic function with 2 methods)
julia> @btime bench($(FixedDecimal{UInt32,3}(1.234)), $10000)
48.416 μs (0 allocations: 0 bytes)
FixedDecimal{UInt32,3}(4191932.283) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another surface level review. I can do a full review if no one else will. I will need to block off some time though as there's a bit going on in this PR.
Co-authored-by: Curtis Vogt <[email protected]>
I think Tomas should be able to review when he's back from vacation. He's out til thursday. :) |
…change Tests all FD{(U)Int16} values. Tests most corner cases for FD{(U)Int128} values.
44686f3
to
ac3302e
Compare
We must use the precise value of 2^nbits(T) in order to get the correct division in all cases. ....... UGH except now the Int64 tests aren't passing.
ac3302e
to
e4cb73b
Compare
LLVM _does_ do this automatically if you directly call `fldmod`. Improve the comments as well.
… not fld(x,y),mod(x,y), even if y is a constant! :) Improved that for julia versions 1.8 - 1.9
Okay, after reviewing and integrating all of Tomas' changes, and cleaning things up, i think this is a great improvement and is ready to go! :) Tomas and I are going to look through it one more time, but otherwise I think this is fully ready for review. Thanks! |
I also updated the PR comment with the final perf numbers, which look 👌 |
By the way, |
Oh wow, very nice! Thanks @EdsterG. 💪 This should still help quite a bit, but the baseline for comparison would be better in 1.11, then! 💪 |
Finally implements the fast-multiplication optimization from #45! :)
This PR optimizes multiplication for
FixedDecimal{Int64}
andFixedDecimal{Int128}
. In the process, we also undid an earlier optimization which is no longer needed after julia 1.8, and that makes multiplication about 2x fast for the smaller int types as well! 🎉This is a follow-up to
#93, which introduces an Int256 type for widemul. However after that PR, the fldmod still required 2 BigInt allocations.
Now, this PR uses a custom implementation of the LLVM div-by-const optimization for (U)Int128 and for (U)Int256, which briefly widens to Int512 (😅), to perform the fldmod by the constant 10^f coefficient.
After this PR, FD multiply performance scales linear with the number of bits. FD{Int128} has no allocations, and is only 2x slower than 64-bit. :) And it makes all other multiplications ~2x faster.
master:
After PR #93:
After this PR: