Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline Avx2 + Scalar GetPointerToFirstInvalidByte /w tests and benchmarks #10

Merged
merged 88 commits into from
Feb 28, 2024

Conversation

Nick-Nuon
Copy link
Collaborator

@Nick-Nuon Nick-Nuon commented Nov 20, 2023

This is a draft pull request.

What I've done thus far:
-Added the content of the scalar GetPointerToFirstInvalidByte 's PR
-Added tests + benchmarks for said function
-Rechecked the code for the utf8 random generator and used C#'s native functions more.
-For the AVX2 part, everything compiles, it still needs a lot of polish and there's one or two functions that I know do the wrong thing but grosso modo, the structure is there so hopefully it doesn't take too long to finish it.

Adding the final benchmarks for the scalar version and finalize my review of last week's PR this morning.

@Nick-Nuon
Copy link
Collaborator Author

Just a quick update to par since last week:

A)I partially fixed the memory allocation. Turns out this part

                // byte[] maxArray = new byte[32]
                // {
                //         255, 255, 255, 255, 255, 255, 255, 255,
                //         255, 255, 255, 255, 255, 255, 255, 255,
                //         255, 255, 255, 255, 255, 255, 255, 255,
                //         255, 255, 255, 255, 255, 0b11110000 - 1, 0b11100000 - 1, 0b11000000 - 1
                // };
                // Vector256<byte> max_value = Vector256.Create(maxArray);

inside is_incomplete was the culprit so I made it static and moved it out of the function.There is some light improvements
but no cigar

B)I also ported this optimization: simdjson/simdjson#2113
In this particular PR ,it doesn't seem to improve performance by a lot but if anything it makes the code a bit cleaner


|                                     Method |    N |         Mean |      Error |     StdDev |   Gen0 | Allocated |
|------------------------------------------- |----- |-------------:|-----------:|-----------:|-------:|----------:|
|          Error_GetIndexOfFirstNonAsciiByte |  100 |     12.60 us |   0.249 us |   0.402 us |      - |         - |
|  Error_Runtime_GetIndexOfFirstNonAsciiByte |  100 |     23.54 us |   0.323 us |   0.302 us |      - |         - |
|        allAsciiGetIndexOfFirstNonAsciiByte |  100 |     17.83 us |   0.354 us |   0.639 us |      - |         - |
| AllAsciiRuntimeGetIndexOfFirstNonAsciiByte |  100 |     22.78 us |   0.453 us |   0.678 us |      - |         - |
|              ScalarUtf8ValidationValidUtf8 |  100 |    247.74 us |   4.711 us |   4.838 us |      - |         - |
|         CompetitionUtf8ValidationValidUtf8 |  100 |    178.25 us |   3.198 us |   2.991 us |      - |         - |
|                SIMDUtf8ValidationValidUtf8 |  100 |    169.25 us |   3.100 us |   2.748 us | 0.4883 |   56000 B |
|              ScalarUtf8ValidationErrorUtf8 |  100 |    113.92 us |   1.586 us |   1.406 us |      - |         - |
|         CompetitionUtf8ValidationErrorUtf8 |  100 |     82.67 us |   1.642 us |   1.757 us |      - |         - |
|                SIMDUtf8ValidationErrorUtf8 |  100 |    160.79 us |   3.097 us |   2.896 us | 0.4883 |   56000 B |
|          Error_GetIndexOfFirstNonAsciiByte | 8000 |     19.37 us |   0.380 us |   0.373 us |      - |         - |
|  Error_Runtime_GetIndexOfFirstNonAsciiByte | 8000 |     34.58 us |   0.686 us |   1.238 us |      - |         - |
|        allAsciiGetIndexOfFirstNonAsciiByte | 8000 |  1,826.10 us |  30.585 us |  30.038 us |      - |         - |
| AllAsciiRuntimeGetIndexOfFirstNonAsciiByte | 8000 |  1,708.11 us |  21.413 us |  18.982 us |      - |         - |
|              ScalarUtf8ValidationValidUtf8 | 8000 | 20,261.69 us | 396.510 us | 370.896 us |      - |         - |
|         CompetitionUtf8ValidationValidUtf8 | 8000 | 14,485.74 us |  83.979 us |  70.126 us |      - |         - |
|                SIMDUtf8ValidationValidUtf8 | 8000 |  8,506.82 us |  49.775 us |  46.560 us |      - |   34004 B |
|              ScalarUtf8ValidationErrorUtf8 | 8000 | 10,011.89 us | 199.492 us | 186.605 us |      - |         - |
|         CompetitionUtf8ValidationErrorUtf8 | 8000 |  7,451.44 us | 106.267 us |  99.402 us |      - |         - |
|                SIMDUtf8ValidationErrorUtf8 | 8000 |  8,828.09 us |  96.876 us |  90.618 us |      - |   34004 B |
|                             RuntimeIsAscii |  100 |     27.13 us |   0.514 us |   0.505 us |      - |         - |
|                             RuntimeIsAscii | 8000 |  3,543.28 us |  70.348 us | 179.058 us |      - |         - |
|                         FastUnicodeIsAscii |  100 |     41.19 us |   0.806 us |   1.278 us |      - |         - |
|                         FastUnicodeIsAscii | 8000 |  4,225.74 us |  81.081 us |  71.876 us |      - |         - |
|                                         Method |               FileName |           Mean |         Error |        StdDev | Allocated |
|----------------------------------------------- |----------------------- |---------------:|--------------:|--------------:|----------:|
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |   data/arabic.utf8.txt |       2.647 ns |     0.0761 ns |     0.0906 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |   data/arabic.utf8.txt |       3.038 ns |     0.0340 ns |     0.0301 ns |         - |
|              CompetitionUtf8ValidationRealData |   data/arabic.utf8.txt | 194,915.258 ns |   229.5872 ns |   191.7157 ns |         - |
|                   ScalarUtf8ValidationRealData |   data/arabic.utf8.txt | 452,663.139 ns | 1,981.5718 ns | 1,654.7022 ns |         - |
|                     SIMDUtf8ValidationRealData |   data/arabic.utf8.txt | 380,283.241 ns | 3,672.5271 ns | 3,435.2841 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |   data/arabic.utf8.txt | 269,269.744 ns | 2,750.6511 ns | 2,572.9608 ns |         - |
|                    SIMDUtf8ValidationErrorData |   data/arabic.utf8.txt | 359,551.161 ns | 6,823.4852 ns | 6,701.5748 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |   data/arabic.utf8.txt | 131,805.240 ns |   168.8621 ns |   149.6919 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |  data/chinese.utf8.txt |       2.566 ns |     0.0095 ns |     0.0074 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |  data/chinese.utf8.txt |       3.034 ns |     0.0444 ns |     0.0416 ns |         - |
|              CompetitionUtf8ValidationRealData |  data/chinese.utf8.txt |  29,881.585 ns |   440.7487 ns |   368.0452 ns |         - |
|                   ScalarUtf8ValidationRealData |  data/chinese.utf8.txt | 127,053.606 ns | 1,181.9201 ns |   986.9568 ns |         - |
|                     SIMDUtf8ValidationRealData |  data/chinese.utf8.txt | 130,175.978 ns |   930.1571 ns |   870.0695 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |  data/chinese.utf8.txt |  15,953.696 ns |   266.8674 ns |   236.5710 ns |         - |
|                    SIMDUtf8ValidationErrorData |  data/chinese.utf8.txt | 130,477.342 ns |   668.3972 ns |   592.5167 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |  data/chinese.utf8.txt |   5,310.082 ns |    72.9779 ns |    68.2635 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |  data/english.utf8.txt |      21.848 ns |     0.0805 ns |     0.0714 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |  data/english.utf8.txt |      21.091 ns |     0.1198 ns |     0.1121 ns |         - |
|              CompetitionUtf8ValidationRealData |  data/english.utf8.txt |  15,110.929 ns |    30.2074 ns |    28.2560 ns |         - |
|                   ScalarUtf8ValidationRealData |  data/english.utf8.txt |  10,897.922 ns |     7.7879 ns |     6.9038 ns |         - |
|                     SIMDUtf8ValidationRealData |  data/english.utf8.txt |  36,164.394 ns |    65.0368 ns |    57.6534 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |  data/english.utf8.txt |  10,915.330 ns |    25.6587 ns |    22.7457 ns |         - |
|                    SIMDUtf8ValidationErrorData |  data/english.utf8.txt |  36,699.325 ns |   161.2595 ns |   142.9524 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |  data/english.utf8.txt |  11,685.818 ns |   230.6991 ns |   236.9110 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |   data/french.utf8.txt |       2.968 ns |     0.0065 ns |     0.0051 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |   data/french.utf8.txt |       4.053 ns |     0.0217 ns |     0.0192 ns |         - |
|              CompetitionUtf8ValidationRealData |   data/french.utf8.txt |  72,476.021 ns |   290.5630 ns |   242.6332 ns |         - |
|                   ScalarUtf8ValidationRealData |   data/french.utf8.txt |  12,512.942 ns |    97.6904 ns |    91.3797 ns |         - |
|                     SIMDUtf8ValidationRealData |   data/french.utf8.txt | 228,035.958 ns | 1,057.1018 ns |   937.0932 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |   data/french.utf8.txt |  12,515.675 ns |    77.3122 ns |    72.3179 ns |         - |
|                    SIMDUtf8ValidationErrorData |   data/french.utf8.txt | 236,959.070 ns |   402.9341 ns |   357.1905 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |   data/french.utf8.txt |  22,235.813 ns |    57.1673 ns |    53.4743 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |   data/german.utf8.txt |       4.348 ns |     0.0090 ns |     0.0075 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |   data/german.utf8.txt |       5.328 ns |     0.0986 ns |     0.0823 ns |         - |
|              CompetitionUtf8ValidationRealData |   data/german.utf8.txt |  14,707.576 ns |   232.2501 ns |   228.1007 ns |         - |
|                   ScalarUtf8ValidationRealData |   data/german.utf8.txt |   5,919.259 ns |   111.8338 ns |   160.3887 ns |         - |
|                     SIMDUtf8ValidationRealData |   data/german.utf8.txt |  75,705.678 ns |   612.8447 ns |   543.2709 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |   data/german.utf8.txt |   5,812.766 ns |    66.0000 ns |    55.1130 ns |         - |
|                    SIMDUtf8ValidationErrorData |   data/german.utf8.txt |  70,121.272 ns |   194.9313 ns |   182.3389 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |   data/german.utf8.txt |   6,997.482 ns |    82.6602 ns |    77.3204 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData | data/japanese.utf8.txt |       2.618 ns |     0.0074 ns |     0.0062 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData | data/japanese.utf8.txt |       3.048 ns |     0.0513 ns |     0.0455 ns |         - |
|              CompetitionUtf8ValidationRealData | data/japanese.utf8.txt |  24,984.446 ns |    93.5948 ns |    82.9694 ns |         - |
|                   ScalarUtf8ValidationRealData | data/japanese.utf8.txt | 131,955.738 ns |   408.5018 ns |   318.9313 ns |         - |
|                     SIMDUtf8ValidationRealData | data/japanese.utf8.txt | 118,223.594 ns |   965.6598 ns |   856.0322 ns |      56 B |
|                  ScalarUtf8ValidationErrorData | data/japanese.utf8.txt |  71,848.170 ns |   701.1108 ns |   621.5165 ns |         - |
|                    SIMDUtf8ValidationErrorData | data/japanese.utf8.txt | 118,406.131 ns |   342.9830 ns |   267.7786 ns |      56 B |
|             CompetitionUtf8ValidationErrorData | data/japanese.utf8.txt |  17,512.550 ns |   173.7209 ns |   162.4986 ns |         - |
| SimDUnicodeGetIndexOfFirstNonAsciiByteRealData |  data/turkish.utf8.txt |       2.583 ns |     0.0091 ns |     0.0076 ns |         - |
|     RuntimeGetIndexOfFirstNonAsciiByteRealData |  data/turkish.utf8.txt |       3.048 ns |     0.0253 ns |     0.0225 ns |         - |
|              CompetitionUtf8ValidationRealData |  data/turkish.utf8.txt |  25,001.746 ns |   491.8286 ns |   566.3906 ns |         - |
|                   ScalarUtf8ValidationRealData |  data/turkish.utf8.txt | 125,196.423 ns |   843.0279 ns |   788.5689 ns |         - |
|                     SIMDUtf8ValidationRealData |  data/turkish.utf8.txt | 117,193.867 ns |   325.1360 ns |   271.5033 ns |      56 B |
|                  ScalarUtf8ValidationErrorData |  data/turkish.utf8.txt |  78,627.716 ns |   661.9583 ns |   552.7651 ns |         - |
|                    SIMDUtf8ValidationErrorData |  data/turkish.utf8.txt | 110,759.805 ns | 1,171.5155 ns | 1,095.8363 ns |      56 B |
|             CompetitionUtf8ValidationErrorData |  data/turkish.utf8.txt |  22,579.817 ns |   411.5248 ns |   590.1966 ns |         - |

@lemire
Copy link
Member

lemire commented Jan 29, 2024

I'm looking...

@lemire
Copy link
Member

lemire commented Jan 30, 2024

Using https://marketplace.visualstudio.com/items?itemName=EgorBogatov.Disasmo&ssr=false#overview

I am getting the following assembly:

; Method SimdUnicode.Utf8Utility:GetPointerToFirstInvalidByte(ulong,int):ulong (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbp
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 824
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0x320], xmm6
       vmovaps  xmmword ptr [rsp+0x310], xmm7
       vmovaps  xmmword ptr [rsp+0x300], xmm8
       vmovaps  xmmword ptr [rsp+0x2F0], xmm9
       vmovaps  xmmword ptr [rsp+0x2E0], xmm10
       vmovaps  xmmword ptr [rsp+0x2D0], xmm11
       vmovaps  xmmword ptr [rsp+0x2C0], xmm12
       vmovaps  xmmword ptr [rsp+0x2B0], xmm13
       vmovaps  xmmword ptr [rsp+0x2A0], xmm14
       vmovaps  xmmword ptr [rsp+0x290], xmm15
       lea      rbp, [rsp+0x360]
       vxorps   xmm4, xmm4, xmm4
       mov      rax, -432
       vmovdqa  xmmword ptr [rbp+rax-0xE0], xmm4
       vmovdqa  xmmword ptr [rbp+rax-0xD0], xmm4
       vmovdqa  xmmword ptr [rbp+rax-0xC0], xmm4
       add      rax, 48
       jne      SHORT  -5 instr
       mov      qword ptr [rbp-0xE0], rax
       mov      rax, 0x24B5C9AFE044
       mov      qword ptr [rbp-0xD8], rax

G_M000_IG02:                ;; offset=0x00BB
       mov      ebx, edx
       mov      rsi, rcx
       test     rsi, rsi
       je       SHORT G_M000_IG04

G_M000_IG03:                ;; offset=0x00C5
       test     ebx, ebx
       jg       G_M000_IG07

G_M000_IG04:                ;; offset=0x00CD
       mov      rax, rsi
       mov      rcx, 0x24B5C9AFE044
       cmp      qword ptr [rbp-0xD8], rcx
       je       SHORT G_M000_IG05
       call     CORINFO_HELP_FAIL_FAST

G_M000_IG05:                ;; offset=0x00E8
       nop      

G_M000_IG06:                ;; offset=0x00E9
       vmovaps  xmm6, xmmword ptr [rsp+0x320]
       vmovaps  xmm7, xmmword ptr [rsp+0x310]
       vmovaps  xmm8, xmmword ptr [rsp+0x300]
       vmovaps  xmm9, xmmword ptr [rsp+0x2F0]
       vmovaps  xmm10, xmmword ptr [rsp+0x2E0]
       vmovaps  xmm11, xmmword ptr [rsp+0x2D0]
       vmovaps  xmm12, xmmword ptr [rsp+0x2C0]
       vmovaps  xmm13, xmmword ptr [rsp+0x2B0]
       vmovaps  xmm14, xmmword ptr [rsp+0x2A0]
       vmovaps  xmm15, xmmword ptr [rsp+0x290]
       vzeroupper 
       add      rsp, 824
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       pop      rbp
       ret      

G_M000_IG07:                ;; offset=0x0156
       test     byte  ptr [(reloc 0x7ff89cc90961)], 1
       je       G_M000_IG23

G_M000_IG08:                ;; offset=0x0163
       mov      rdx, 0x19447C01D78
       mov      rdx, gword ptr [rdx]
       mov      r8d, dword ptr [rdx+0x08]
       cmp      r8d, 32
       jl       G_M000_IG24
       vmovups  ymm6, ymmword ptr [rdx+0x10]
       vxorps   ymm7, ymm7, ymm7
       vxorps   ymm8, ymm8, ymm8
       vxorps   ymm9, ymm9, ymm9
       xor      edi, edi
       cmp      ebx, 32
       jl       G_M000_IG14

G_M000_IG09:                ;; offset=0x019C
       movsxd   rdx, edi
       vmovups  ymm10, ymmword ptr [rsi+rdx]
       vpmovmskb edx, ymm10
       test     edx, edx
       je       G_M000_IG13

G_M000_IG10:                ;; offset=0x01B1
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x130], ymm0
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x130]
       mov      r9d, 15
       vextractf128 xmm9, ymm10, 1
       vextractf128 xmm11, ymm8, 1
       vextractf128 xmm12, ymm7, 1
       vextractf128 xmm13, ymm6, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x170], ymm0
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x190], ymm0
       vmovups  ymm0, ymmword ptr [rbp-0x130]
       vpsrlw   ymm0, ymm0, 4
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 2
       mov      dword ptr [rsp+0x28], 2
       mov      dword ptr [rsp+0x30], 2
       mov      dword ptr [rsp+0x38], 2
       mov      dword ptr [rsp+0x40], 2
       mov      dword ptr [rsp+0x48], 2
       mov      dword ptr [rsp+0x50], 128
       mov      dword ptr [rsp+0x58], 128
       mov      dword ptr [rsp+0x60], 128
       mov      dword ptr [rsp+0x68], 128
       mov      dword ptr [rsp+0x70], 33
       mov      dword ptr [rsp+0x78], 1
       mov      dword ptr [rsp+0x80], 21
       mov      dword ptr [rsp+0x88], 73
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x150]
       mov      r8d, 2
       mov      r9d, 2
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x130]
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 131
       mov      dword ptr [rsp+0x28], 131
       mov      dword ptr [rsp+0x30], 139
       mov      dword ptr [rsp+0x38], 203
       mov      dword ptr [rsp+0x40], 203
       mov      dword ptr [rsp+0x48], 203
       mov      dword ptr [rsp+0x50], 203
       mov      dword ptr [rsp+0x58], 203
       mov      dword ptr [rsp+0x60], 203
       mov      dword ptr [rsp+0x68], 203
       mov      dword ptr [rsp+0x70], 203
       mov      dword ptr [rsp+0x78], 219
       mov      dword ptr [rsp+0x80], 203
       mov      dword ptr [rsp+0x88], 203
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x170]
       mov      r8d, 231
       mov      r9d, 163
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]

G_M000_IG11:                ;; offset=0x037D
       vinsertf128 ymm10, ymm10, xmm9, 1
       vpsrlw   ymm0, ymm10, 4
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x40], 1
       mov      dword ptr [rsp+0x48], 1
       mov      dword ptr [rsp+0x50], 230
       mov      dword ptr [rsp+0x58], 174
       mov      dword ptr [rsp+0x60], 186
       mov      dword ptr [rsp+0x68], 186
       mov      dword ptr [rsp+0x70], 1
       mov      dword ptr [rsp+0x78], 1
       mov      dword ptr [rsp+0x80], 1
       mov      dword ptr [rsp+0x88], 1
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x190]
       mov      r8d, 1
       mov      r9d, 1
       vextractf128 xmm9, ymm10, 1
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x150]
       vpand    ymm0, ymm0, ymmword ptr [rbp-0x170]
       vpand    ymm14, ymm0, ymmword ptr [rbp-0x190]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x1B0], ymm0
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x1D0], ymm0
       vinsertf128 ymm10, ymm10, xmm9, 1
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vinsertf128 ymm8, ymm8, xmm11, 1
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x1B0]
       mov      r9d, 14
       vextractf128 xmm9, ymm10, 1
       vextractf128 xmm11, ymm8, 1
       vextractf128 xmm15, ymm14, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vinsertf128 ymm10, ymm10, xmm9, 1
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vinsertf128 ymm8, ymm8, xmm11, 1
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x1D0]
       mov      r9d, 13
       vextractf128 xmm9, ymm10, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x1B0]
       vpsubusb ymm0, ymm0, ymmword ptr [reloc @RWD32]
       vmovups  ymm1, ymmword ptr [rbp-0x1D0]
       vpsubusb ymm1, ymm1, ymmword ptr [reloc @RWD64]
       vpor     ymm0, ymm0, ymm1
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD96]
       vinsertf128 ymm14, ymm14, xmm15, 1
       vpxor    ymm0, ymm0, ymm14

G_M000_IG12:                ;; offset=0x0532
       vinsertf128 ymm7, ymm7, xmm12, 1
       vpor     ymm7, ymm7, ymm0
       vinsertf128 ymm10, ymm10, xmm9, 1
       vinsertf128 ymm6, ymm6, xmm13, 1
       vpsubusw ymm0, ymm10, ymm6
       vmovaps  ymm9, ymm0

G_M000_IG13:                ;; offset=0x0550
       vmovaps  ymm8, ymm10
       add      edi, 32
       lea      edx, [rdi+0x20]
       cmp      edx, ebx
       jle      G_M000_IG09

G_M000_IG14:                ;; offset=0x0563
       cmp      edi, ebx
       jge      G_M000_IG20
       lea      r14, [rbp-0xF8]
       xor      edx, edx
       mov      r15d, ebx
       sub      r15d, edi
       test     r15d, r15d
       jle      SHORT G_M000_IG16
       align    [1 bytes for IG15]

G_M000_IG15:                ;; offset=0x0580
       cmp      edx, 32
       jae      G_M000_IG25
       mov      ecx, edx
       lea      eax, [rdi+rdx]
       cdqe     
       movzx    rax, byte  ptr [rax+rsi]
       mov      byte  ptr [r14+rcx], al
       inc      edx
       cmp      r15d, edx
       jg       SHORT G_M000_IG15

G_M000_IG16:                ;; offset=0x059F
       mov      edx, 32
       mov      rcx, 0x7FF89CC99F60
       vextractf128 xmm11, ymm8, 1
       vextractf128 xmm12, ymm7, 1
       vextractf128 xmm10, ymm9, 1
       vextractf128 xmm13, ymm6, 1
       call     CORINFO_HELP_NEWARR_1_VC
       lea      rdx, bword ptr [rax+0x10]
       vmovdqu  ymm0, ymmword ptr [r14]
       vmovdqu  ymmword ptr [rdx], ymm0
       mov      edx, dword ptr [rax+0x08]
       cmp      edx, 32
       vinsertf128 ymm8, ymm8, xmm11, 1
       vinsertf128 ymm7, ymm7, xmm12, 1
       vinsertf128 ymm9, ymm9, xmm10, 1
       vinsertf128 ymm6, ymm6, xmm13, 1
       jl       G_M000_IG24
       vmovups  ymm10, ymmword ptr [rax+0x10]
       vpmovmskb edx, ymm10
       test     edx, edx
       je       G_M000_IG19
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x1F0]
       mov      r9d, 15
       vextractf128 xmm11, ymm8, 1
       vextractf128 xmm12, ymm7, 1
       vextractf128 xmm9, ymm10, 1
       vextractf128 xmm13, ymm6, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x1F0]
       vpsrlw   ymm0, ymm0, 4
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 2
       mov      dword ptr [rsp+0x28], 2
       mov      dword ptr [rsp+0x30], 2
       mov      dword ptr [rsp+0x38], 2
       mov      dword ptr [rsp+0x40], 2
       mov      dword ptr [rsp+0x48], 2
       mov      dword ptr [rsp+0x50], 128
       mov      dword ptr [rsp+0x58], 128
       mov      dword ptr [rsp+0x60], 128
       mov      dword ptr [rsp+0x68], 128
       mov      dword ptr [rsp+0x70], 33
       mov      dword ptr [rsp+0x78], 1
       mov      dword ptr [rsp+0x80], 21
       mov      dword ptr [rsp+0x88], 73
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x210]
       mov      r8d, 2
       mov      r9d, 2
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x1F0]
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 131

G_M000_IG17:                ;; offset=0x0728
       mov      dword ptr [rsp+0x28], 131
       mov      dword ptr [rsp+0x30], 139
       mov      dword ptr [rsp+0x38], 203
       mov      dword ptr [rsp+0x40], 203
       mov      dword ptr [rsp+0x48], 203
       mov      dword ptr [rsp+0x50], 203
       mov      dword ptr [rsp+0x58], 203
       mov      dword ptr [rsp+0x60], 203
       mov      dword ptr [rsp+0x68], 203
       mov      dword ptr [rsp+0x70], 203
       mov      dword ptr [rsp+0x78], 219
       mov      dword ptr [rsp+0x80], 203
       mov      dword ptr [rsp+0x88], 203
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x230]
       mov      r8d, 231
       mov      r9d, 163
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vinsertf128 ymm10, ymm10, xmm9, 1
       vpsrlw   ymm0, ymm10, 4
       vmovups  ymmword ptr [rbp-0x2B0], ymm0
       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x40], 1
       mov      dword ptr [rsp+0x48], 1
       mov      dword ptr [rsp+0x50], 230
       mov      dword ptr [rsp+0x58], 174
       mov      dword ptr [rsp+0x60], 186
       mov      dword ptr [rsp+0x68], 186
       mov      dword ptr [rsp+0x70], 1
       mov      dword ptr [rsp+0x78], 1
       mov      dword ptr [rsp+0x80], 1
       mov      dword ptr [rsp+0x88], 1
       lea      rdx, [rbp-0x2B0]
       lea      rcx, [rbp-0x250]
       mov      r8d, 1
       mov      r9d, 1
       vextractf128 xmm9, ymm10, 1
       call     [Vector256Extensions:Lookup16(System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x210]
       vpand    ymm0, ymm0, ymmword ptr [rbp-0x230]
       vpand    ymm14, ymm0, ymmword ptr [rbp-0x250]
       vinsertf128 ymm10, ymm10, xmm9, 1
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vinsertf128 ymm8, ymm8, xmm11, 1
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x270]
       mov      r9d, 14
       vextractf128 xmm11, ymm8, 1
       vextractf128 xmm9, ymm10, 1
       vextractf128 xmm15, ymm14, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vinsertf128 ymm10, ymm10, xmm9, 1
       vmovups  ymmword ptr [rbp-0x2B0], ymm10
       vinsertf128 ymm8, ymm8, xmm11, 1
       vperm2i128 ymm0, ymm8, ymm10, 33
       vmovups  ymmword ptr [rbp-0x2D0], ymm0

G_M000_IG18:                ;; offset=0x08F5
       lea      rdx, [rbp-0x2B0]
       lea      r8, [rbp-0x2D0]
       lea      rcx, [rbp-0x290]
       mov      r9d, 13
       vextractf128 xmm9, ymm10, 1
       call     [System.Runtime.Intrinsics.X86.Avx2:AlignRight(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte],ubyte):System.Runtime.Intrinsics.Vector256`1[ubyte]]
       vmovups  ymm0, ymmword ptr [rbp-0x270]
       vpsubusb ymm0, ymm0, ymmword ptr [reloc @RWD32]
       vmovups  ymm1, ymmword ptr [rbp-0x290]
       vpsubusb ymm1, ymm1, ymmword ptr [reloc @RWD64]
       vpor     ymm0, ymm0, ymm1
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD96]
       vinsertf128 ymm14, ymm14, xmm15, 1
       vpxor    ymm0, ymm0, ymm14
       vinsertf128 ymm7, ymm7, xmm12, 1
       vpor     ymm7, ymm7, ymm0
       vinsertf128 ymm10, ymm10, xmm9, 1
       vinsertf128 ymm6, ymm6, xmm13, 1
       vpsubusw ymm0, ymm10, ymm6
       vmovaps  ymm9, ymm0

G_M000_IG19:                ;; offset=0x0971
       add      r15d, edi
       mov      edi, r15d

G_M000_IG20:                ;; offset=0x0977
       vpor     ymm7, ymm7, ymm9
       movsxd   rax, ebx
       add      rax, rsi
       movsxd   rcx, edi
       add      rcx, rsi
       vptest   ymm7, ymm7
       cmovne   rax, rcx
       mov      rcx, 0x24B5C9AFE044
       cmp      qword ptr [rbp-0xD8], rcx
       je       SHORT G_M000_IG21
       call     CORINFO_HELP_FAIL_FAST

G_M000_IG21:                ;; offset=0x09A9
       nop      

G_M000_IG22:                ;; offset=0x09AA
       vmovaps  xmm6, xmmword ptr [rsp+0x320]
       vmovaps  xmm7, xmmword ptr [rsp+0x310]
       vmovaps  xmm8, xmmword ptr [rsp+0x300]
       vmovaps  xmm9, xmmword ptr [rsp+0x2F0]
       vmovaps  xmm10, xmmword ptr [rsp+0x2E0]
       vmovaps  xmm11, xmmword ptr [rsp+0x2D0]
       vmovaps  xmm12, xmmword ptr [rsp+0x2C0]
       vmovaps  xmm13, xmmword ptr [rsp+0x2B0]
       vmovaps  xmm14, xmmword ptr [rsp+0x2A0]
       vmovaps  xmm15, xmmword ptr [rsp+0x290]
       vzeroupper 
       add      rsp, 824
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       pop      rbp
       ret      

G_M000_IG23:                ;; offset=0x0A17
       mov      rcx, 0x7FF89CC90928
       mov      edx, 9
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       jmp      G_M000_IG08

G_M000_IG24:                ;; offset=0x0A30
       call     [System.ThrowHelper:ThrowArgumentOutOfRange_IndexMustBeLessOrEqualException()]
       int3     

G_M000_IG25:                ;; offset=0x0A37
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD32  	dq	6060606060606060h, 6060606060606060h, 6060606060606060h, 6060606060606060h
RWD64  	dq	7070707070707070h, 7070707070707070h, 7070707070707070h, 7070707070707070h
RWD96  	dq	8080808080808080h, 8080808080808080h, 8080808080808080h, 8080808080808080h
; Total bytes of code: 2621

checker.check_eof();
if (checker.errors())
{
return pInputBuffer + processedLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only checks for error at the end. So I expect that whether checker.errors() is true or false, it will still scan the entire input.

Now, this is fine per se, but what you have implemented is a function to check whether the function is valid or invalid. The equivalent of GetPointerToFirstInvalidByte in simdutf is validate_utf8_with_errors and it seems that you have implemented validate_utf8. No big deal, but it makes any benchmarking premature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Im on it first thing in the morning

@Nick-Nuon Nick-Nuon merged commit a5f4c20 into main Feb 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants