Skip to content

Commit

Permalink
Merge pull request #13 from simdutf/AVX2_UTF8_validation_disposable
Browse files Browse the repository at this point in the history
Simplified version of AVX2 UTF-8 validation
  • Loading branch information
Nick-Nuon committed Feb 28, 2024
2 parents 719b85f + 1a5e552 commit ee9ee16
Show file tree
Hide file tree
Showing 42 changed files with 56,810 additions and 249 deletions.
48 changes: 34 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,21 @@

This is a fast C# library to process unicode strings.

*It is currently not meant to be usable.*

## Motivation

The most important immediate goal would be to speed up the
`Utf8Utility.GetPointerToFirstInvalidByte` function.
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.

https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs


(We may need to speed up `Ascii.GetIndexOfFirstNonAsciiByte` first, see issue https://github.com/simdutf/SimdUnicode/issues/1.)
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021

The question is whether we could do it using this routine:
The function is private in the Runtime, but we can expose it manually.

* John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021

Our generic implementation is available there: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

Porting it to C# is no joke, but doable.

## Requirements

We recommend you install .NET 7: https://dotnet.microsoft.com/en-us/download/dotnet/7.0
We recommend you install .NET 8: https://dotnet.microsoft.com/en-us/download/dotnet/8.0


## Running tests
Expand All @@ -35,8 +27,21 @@ cd test
dotnet test
```

To get a list of available tests, enter the command:

```
dotnet test --list-tests
```

To run specific tests, it is helpful to use the filter parameter:

```
dotnet test -c Release --filter Ascii
```

## Running Benchmarks

To run the benchmarks, run the following command:
```
cd benchmark
dotnet run -c Release
Expand All @@ -49,7 +54,6 @@ cd benchmark
sudo dotnet run -c Release
```


## Building the library

```
Expand All @@ -66,10 +70,26 @@ cd test
dotnet format
```

## Programming tips

You can print the content of a vector register like so:

```C#
public static void ToString(Vector256<byte> v)
{
Span<byte> b = stackalloc byte[32];
v.CopyTo(b);
Console.WriteLine(Convert.ToHexString(b));
}
```

## More reading


https://github.com/dotnet/coreclr/pull/21948/files#diff-2a22774bd6bff8e217ecbb3a41afad033ce0ca0f33645e9d8f5bdf7c9e3ac248

https://github.com/dotnet/runtime/issues/41699

https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/

https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/coding-style/coding-conventions
6 changes: 5 additions & 1 deletion benchmark/CS_runtime.cs → benchmark/ASCII_runtime.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,18 @@
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;

// This is from the Runtime. Copy/pasted as I found no other way to benchmark it.

//Changes from original:
//copy pasted CompExactlyDependsOnAttribute : Attribute into System.Text namespace
//copy/pasted StoreLowerUnsafe into ascii class
//The various Vector.Size likely refer to size in bytes so
//Replaced all instances of Vector512.Size by 64 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector512.cs,77df495766d5de9c)
//Vector256's by 32 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs,877aa6254c4e4d00)
//Vector128's by 16 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs,eb1e72a6f843c5a5)
// GetIndexofFirstNonAsciiByte is no longer internal

namespace Competition
namespace DotnetRuntime
{
// I copy pasted this from: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CompExactlyDependsOnAttribute.cs
// Use this attribute to indicate that a function should only be compiled into a Ready2Run
Expand Down Expand Up @@ -213,6 +216,7 @@ private static bool FirstCharInUInt32IsAscii(uint value)
/// <returns>An ASCII byte is defined as 0x00 - 0x7F, inclusive.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe nuint GetIndexOfFirstNonAsciiByte(byte* pBuffer, nuint bufferLength)
// internal static unsafe nuint GetIndexOfFirstNonAsciiByte(byte* pBuffer, nuint bufferLength)
{
// If 256/512-bit aren't supported but SSE2 is supported, use those specific intrinsics instead of
// the generic vectorized code. This has two benefits: (a) we can take advantage of specific instructions
Expand Down
Loading

0 comments on commit ee9ee16

Please sign in to comment.