JIT: Unblock Vector###<long> intrinsics on x86 #112728

saucecontrol · 2025-02-20T05:25:43Z

This resolves a large number of TODOs around HWIntrinsic expansion involving scalar longs on x86.

The most significant change here is in promoting CreateScalar and ToScalar to code generating intrinsics instead of converting them to other intrinsics at lowering. This was necessary in order to handle emitting movq for scalar long loads/stores but also unlocks several other optimizations since we can now allow CreateScalar and ToScalar to be contained and can specialize codegen depending on whether they end up loading/storing from/to memory or not. Some example improvements on x64:

Vector128.CreateScalar(ref float):

-       vinsertps xmm0, xmm0, dword ptr [rbp+0x10], 14
+       vmovss   xmm0, dword ptr [rbp+0x10]

Vector128.CreateScalar(ref double):

-       vxorps   xmm0, xmm0, xmm0
-       vmovsd   xmm1, qword ptr [rbp-0x08]
-       vmovsd   xmm0, xmm0, xmm1
+       vmovsd   xmm0, qword ptr [rbp-0x08]

Vector128.CreateScalarUnsafe(ref short):

-       movsx    rcx, word  ptr [rbp+0x10]
-       vmovd    xmm0, ecx
+       vpinsrw  xmm0, xmm0, word  ptr [rbp+0x10], 0

ref byte b = Vector128<byte>.ToScalar():

-       vmovd    r9d, xmm3
-       mov      byte  ptr [r10], r9b
+       vpextrb  byte  ptr [r10], xmm3, 0

And the less realistic, but still interesting
Sse.AddScalar(Vector128.CreateScalar(ref float), Vector128.CreateScalar(ref float)).ToScalar():

-       xorps    xmm0, xmm0
-       movss    xmm1, dword ptr [rcx]
-       movss    xmm0, xmm1
-       xorps    xmm1, xmm1
-       movss    xmm2, dword ptr [rdx]
-       movss    xmm1, xmm2
-       addss    xmm0, xmm1
+       movss    xmm0, dword ptr [rcx]
+       addss    xmm0, dword ptr [rdx]

x86 diffs are much more significant, because of the newly-enabled intrinsic expansion:

Collection	Base size (bytes)	Diff size (bytes)	PerfScore in Diffs
benchmarks.run.windows.x86.checked.mch	6,177,604	-1,835	-2.29%
benchmarks.run_pgo.windows.x86.checked.mch	18,273,858	-711	+0.05%
benchmarks.run_tiered.windows.x86.checked.mch	8,700,479	-922	+0.16%
coreclr_tests.run.windows.x86.checked.mch	296,857,667	-199,305	-5.90%
libraries.crossgen2.windows.x86.checked.mch	27,192,130	-15,832	-4.54%
libraries.pmi.windows.x86.checked.mch	31,504,629	-14,394	-2.23%
libraries_tests.run.windows.x86.Release.mch	178,679,365	-41,771	-1.96%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	99,804,605	-79,997	-4.08%
realworld.run.windows.x86.checked.mch	8,787,045	-404	-0.39%

dotnet-policy-service · 2025-02-20T05:26:23Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

saucecontrol

This is ready for review.
cc @tannergooding

saucecontrol · 2025-02-22T00:07:49Z

src/coreclr/jit/lower.h

-        // Keep casts with operands usable from memory.
-        if (castOp->isContained() || castOp->IsRegOptional())
-        {
-            return op;
-        }


This condition, added in #72719, made this method effectively useless. Removing it was a zero-diff change. I can look in future at containing the casts rather than removing them.

saucecontrol · 2025-02-22T00:11:13Z

src/coreclr/jit/lowerxarch.cpp

@@ -4677,19 +4539,16 @@ GenTree* Lowering::LowerHWIntrinsicCreate(GenTreeHWIntrinsic* node)
        return LowerNode(node);
    }

-    GenTree* op2 = node->Op(2);
-
-    // TODO-XArch-AVX512 : Merge the NI_Vector512_Create and NI_Vector256_Create paths below.


The churn in this section is just taking care of this TODO

saucecontrol · 2025-02-22T00:13:48Z

src/coreclr/jit/lowerxarch.cpp


            assert(comp->compIsaSupportedDebugOnly(InstructionSet_SSE2));

            tmp2 = InsertNewSimdCreateScalarUnsafeNode(TYP_SIMD16, op2, simdBaseJitType, 16);
            LowerNode(tmp2);

-            node->ResetHWIntrinsicId(NI_SSE_MoveLowToHigh, tmp1, tmp2);


Changing this to UnpackLow shows up as a regression in a few places, because movlhps is one byte smaller, but it enables other optimizations since unpcklpd takes a memory operand plus mask and embedded broadcast.

Vector128.Create(double, 1.0):

- vmovups xmm0, xmmword ptr [reloc @RWD00] - vmovlhps xmm0, xmm1, xmm0 + vunpcklpd xmm0, xmm1, qword ptr [reloc @RWD00] {1to2}

saucecontrol · 2025-02-22T00:17:06Z

src/coreclr/jit/decomposelongs.cpp

+    if (varDsc->lvIsParam)
+    {
+        // Promotion blocks combined read optimizations for SIMD loads of long params
+        return;
+    }


In isolation, this change produced a small number of diffs and was mostly an improvement. A few regressions show up in the SPMI reports, but the overall impact is good, especially considering the places we can load a long to vector with movq

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 20, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 20, 2025

build-analysis bot mentioned this pull request Feb 20, 2025

LibraryImportGenerator.Unit.Tests crashing on linux-x64 mono interpreter #100800

Open

saucecontrol force-pushed the createscalar64 branch from 197fac5 to 628d4f8 Compare February 20, 2025 18:52

unblock long xplat intrinsics on x86

3a130c8

saucecontrol force-pushed the createscalar64 branch from 628d4f8 to 3a130c8 Compare February 21, 2025 06:31

This was referenced Feb 21, 2025

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked failing with System.OverflowException #112286

Open

System.Numerics.Tensors.Tests.ConvertTests.ConvertChecked test failure #112755

Open

saucecontrol added 2 commits February 21, 2025 11:51

tidying

7f220c2

tidying2

78dc31d

saucecontrol commented Feb 22, 2025

View reviewed changes

saucecontrol marked this pull request as ready for review February 22, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Unblock Vector###<long> intrinsics on x86 #112728

JIT: Unblock Vector###<long> intrinsics on x86 #112728

saucecontrol commented Feb 20, 2025 •

edited

Loading

dotnet-policy-service bot commented Feb 20, 2025

saucecontrol left a comment •

edited

Loading

saucecontrol Feb 22, 2025

saucecontrol Feb 22, 2025

saucecontrol Feb 22, 2025 •

edited

Loading

saucecontrol Feb 22, 2025

JIT: Unblock Vector###<long> intrinsics on x86 #112728

Are you sure you want to change the base?

JIT: Unblock Vector###<long> intrinsics on x86 #112728

Conversation

saucecontrol commented Feb 20, 2025 • edited Loading

dotnet-policy-service bot commented Feb 20, 2025

saucecontrol left a comment • edited Loading

Choose a reason for hiding this comment

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

saucecontrol Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

saucecontrol Feb 22, 2025

Choose a reason for hiding this comment

saucecontrol commented Feb 20, 2025 •

edited

Loading

saucecontrol left a comment •

edited

Loading

saucecontrol Feb 22, 2025 •

edited

Loading