Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some cleaning #37

Merged
merged 13 commits into from
Jun 6, 2024
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,31 @@ This is a fast C# library to validate UTF-8 strings.

## Motivation

We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.

- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function from the C# runtime library.
[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.

The algorithm in question is part of popular JavaScript runtimes such as Node.js and Bun, [by PHP](https://github.com/php/php-src/blob/90e0ce7f0db99767c58dc21e4213c0f8763f657a/ext/mbstring/mbstring.c#L5270), by Oracle GraalVM and many important systems.
Specifically, we provide the function `SimdUnicode.UTF8.GetPointerToFirstInvalidByte` which is a faster
drop-in replacement:
```cs
// Returns &inputBuffer[inputLength] if the input buffer is valid.
/// <summary>
/// Given an input buffer <paramref name="pInputBuffer"/> of byte length <paramref name="inputLength"/>,
/// returns a pointer to where the first invalid data appears in <paramref name="pInputBuffer"/>.
/// The parameter <paramref name="Utf16CodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 2-byte character, -2 for each 3-byte and 4-byte characters.
/// The parameter <paramref name="ScalarCodeUnitCountAdjustment"/> is set according to the content of the valid UTF-8 characters encountered, counting -1 for each 4-byte character.
/// </summary>
/// <remarks>
/// Returns a pointer to the end of <paramref name="pInputBuffer"/> if the buffer is well-formed.
/// </remarks>
public unsafe static byte* GetPointerToFirstInvalidByte(byte* pInputBuffer, int inputLength, out int Utf16CodeUnitCountAdjustment, out int ScalarCodeUnitCountAdjustment);
```

[The function is private in the Microsoft Runtime](https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs), but we can expose it manually.
The function uses advanced instructions (SIMD) on 64-bit ARM and x64 processors, but fallbacks on a
conventional implementation on other systems. We provide extensive tests and benchmarks.

We apply the algorithm used by Node.js, Bun, Oracle GraalVM, by the PHP interpreter and other important systems. The algorithm has been described in the follow article:

- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021


## Requirements
Expand All @@ -30,6 +46,11 @@ dotnet test

To see which tests are running, we recommend setting the verbosity level:

```
dotnet test -v=normal
```

More details could be useful:
```
dotnet test -v d
```
Expand All @@ -44,7 +65,7 @@ To run specific tests, it is helpful to use the filter parameter:


```
dotnet test --filter TooShortErrorAVX
dotnet test --filter TooShortErrorAvx2
```

Or to target specific categories:
Expand Down Expand Up @@ -89,7 +110,6 @@ dotnet build
We recommend you use `dotnet format`. E.g.,

```
cd test
dotnet format
```

Expand All @@ -115,6 +135,7 @@ You can print the content of a vector register like so:
## Performance tips

- Be careful: `Vector128.Shuffle` is not the same as `Ssse3.Shuffle` nor is `Vector128.Shuffle` the same as `Avx2.Shuffle`. Prefer the latter.
- Similarly `Vector128.Shuffle` is not the same as `AdvSimd.Arm64.VectorTableLookup`, use the latter.

## More reading

Expand Down
90 changes: 50 additions & 40 deletions benchmark/Benchmark.cs
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,17 @@ public class Speed : IColumn
{
public string GetValue(Summary summary, BenchmarkCase benchmarkCase)
{
if (summary is null || benchmarkCase is null || benchmarkCase.Parameters is null)
{
return "N/A";
}
var ourReport = summary.Reports.First(x => x.BenchmarkCase.Equals(benchmarkCase));
var fileName = (string)benchmarkCase.Parameters["FileName"];
long length = new System.IO.FileInfo(fileName).Length;
if (ourReport.ResultStatistics is null)
if (ourReport is null || ourReport.ResultStatistics is null)
{
return "N/A";
}
long length = new System.IO.FileInfo(fileName).Length;
var mean = ourReport.ResultStatistics.Mean;
return $"{(length / ourReport.ResultStatistics.Mean):#####.00}";
}
Expand All @@ -46,8 +50,8 @@ public string GetValue(Summary summary, BenchmarkCase benchmarkCase)
public string ColumnName { get; } = "Speed (GB/s)";
public bool AlwaysShow { get; } = true;
public ColumnCategory Category { get; } = ColumnCategory.Custom;
public int PriorityInCategory { get; } = 0;
public bool IsNumeric { get; } = false;
public int PriorityInCategory { get; }
public bool IsNumeric { get; }
public UnitType UnitType { get; } = UnitType.Dimensionless;
public string Legend { get; } = "The speed in gigabytes per second";
}
Expand All @@ -57,8 +61,8 @@ public string GetValue(Summary summary, BenchmarkCase benchmarkCase)
[Config(typeof(Config))]
public class RealDataBenchmark
{

private class Config : ManualConfig
#pragma warning disable CA1812
private sealed class Config : ManualConfig
{
public Config()
{
Expand All @@ -67,6 +71,7 @@ public Config()

if (RuntimeInformation.ProcessArchitecture == Architecture.Arm64)
{
#pragma warning disable CA1303
Console.WriteLine("ARM64 system detected.");
AddFilter(new AnyCategoriesFilter(["arm64", "scalar", "runtime"]));

Expand All @@ -75,21 +80,25 @@ public Config()
{
if (Vector512.IsHardwareAccelerated && System.Runtime.Intrinsics.X86.Avx512Vbmi.IsSupported)
{
#pragma warning disable CA1303
Console.WriteLine("X64 system detected (Intel, AMD,...) with AVX-512 support.");
AddFilter(new AnyCategoriesFilter(["avx512", "avx", "sse", "scalar", "runtime"]));
}
else if (Avx2.IsSupported)
{
#pragma warning disable CA1303
Console.WriteLine("X64 system detected (Intel, AMD,...) with AVX2 support.");
AddFilter(new AnyCategoriesFilter(["avx", "sse", "scalar", "runtime"]));
}
else if (Ssse3.IsSupported)
{
#pragma warning disable CA1303
Console.WriteLine("X64 system detected (Intel, AMD,...) with Sse4.2 support.");
AddFilter(new AnyCategoriesFilter(["sse", "scalar", "runtime"]));
}
else
{
#pragma warning disable CA1303
Console.WriteLine("X64 system detected (Intel, AMD,...) without relevant SIMD support.");
AddFilter(new AnyCategoriesFilter(["scalar", "runtime"]));
}
Expand Down Expand Up @@ -130,14 +139,15 @@ public Config()
@"data/thai.utf8.txt",
@"data/turkish.utf8.txt",
@"data/vietnamese.utf8.txt")]
#pragma warning disable CA1051
public string? FileName;
public byte[] allLinesUtf8 = new byte[0];
private byte[] allLinesUtf8 = Array.Empty<byte>();


public unsafe delegate byte* Utf8ValidationFunction(byte* pUtf8, int length);
public unsafe delegate byte* DotnetRuntimeUtf8ValidationFunction(byte* pUtf8, int length, out int utf16CodeUnitCountAdjustment, out int scalarCountAdjustment);

public void RunUtf8ValidationBenchmark(byte[] data, Utf8ValidationFunction validationFunction)
private void RunUtf8ValidationBenchmark(byte[] data, Utf8ValidationFunction validationFunction)
{
unsafe
{
Expand All @@ -146,13 +156,13 @@ public void RunUtf8ValidationBenchmark(byte[] data, Utf8ValidationFunction valid
var res = validationFunction(pUtf8, data.Length);
if (res != pUtf8 + data.Length)
{
throw new Exception("Invalid UTF-8: I expected the pointer to be at the end of the buffer.");
throw new ArgumentException("Invalid UTF-8: I expected the pointer to be at the end of the buffer.");
}
}
}
}

public void RunDotnetRuntimeUtf8ValidationBenchmark(byte[] data, DotnetRuntimeUtf8ValidationFunction validationFunction)
private void RunDotnetRuntimeUtf8ValidationBenchmark(byte[] data, DotnetRuntimeUtf8ValidationFunction validationFunction)
{
unsafe
{
Expand Down Expand Up @@ -183,20 +193,17 @@ public unsafe void SIMDUtf8ValidationRealData()
{
if (allLinesUtf8 != null)
{
// RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByte);
RunUtf8ValidationBenchmark(allLinesUtf8, (byte* pInputBuffer, int inputLength) =>
{
int dummyUtf16CodeUnitCountAdjustment, dummyScalarCountAdjustment;
// Call the method with additional out parameters within the lambda.
// You must handle these additional out parameters inside the lambda, as they cannot be passed back through the delegate.
return SimdUnicode.UTF8.GetPointerToFirstInvalidByte(pInputBuffer, inputLength, out dummyUtf16CodeUnitCountAdjustment, out dummyScalarCountAdjustment);
});
}
}

[Benchmark]
// [BenchmarkCategory("scalar")]
// public unsafe void Utf8ValidationRealDataScalar()
// {
// if (allLinesUtf8 != null)
// {
// RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteScalar);
// }
// }

[BenchmarkCategory("scalar")]
public unsafe void Utf8ValidationRealDataScalar()
{
Expand All @@ -213,45 +220,48 @@ public unsafe void Utf8ValidationRealDataScalar()
}
}


[Benchmark]
[BenchmarkCategory("arm64")]
public unsafe void SIMDUtf8ValidationRealDataArm64()
{
if (allLinesUtf8 != null)
{
RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteArm64);
RunUtf8ValidationBenchmark(allLinesUtf8, (byte* pInputBuffer, int inputLength) =>
{
int dummyUtf16CodeUnitCountAdjustment, dummyScalarCountAdjustment;
// Call the method with additional out parameters within the lambda.
// You must handle these additional out parameters inside the lambda, as they cannot be passed back through the delegate.
return SimdUnicode.UTF8.GetPointerToFirstInvalidByteArm64(pInputBuffer, inputLength, out dummyUtf16CodeUnitCountAdjustment, out dummyScalarCountAdjustment);
});
}

}
// [Benchmark]
// [BenchmarkCategory("avx")]
// public unsafe void SIMDUtf8ValidationRealDataAvx2()
// {
// if (allLinesUtf8 != null)
// {
// RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteAvx2);
// }
// }

[Benchmark]
[BenchmarkCategory("sse")]
public unsafe void SIMDUtf8ValidationRealDataSse()
[BenchmarkCategory("avx")]
public unsafe void SIMDUtf8ValidationRealDataAvx2()
{
if (allLinesUtf8 != null)
{
RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteSse);
RunUtf8ValidationBenchmark(allLinesUtf8, (byte* pInputBuffer, int inputLength) =>
{
int dummyUtf16CodeUnitCountAdjustment, dummyScalarCountAdjustment;
// Call the method with additional out parameters within the lambda.
// You must handle these additional out parameters inside the lambda, as they cannot be passed back through the delegate.
return SimdUnicode.UTF8.GetPointerToFirstInvalidByteAvx2(pInputBuffer, inputLength, out dummyUtf16CodeUnitCountAdjustment, out dummyScalarCountAdjustment);
});
}
}
/*
// TODO: enable this benchmark when the AVX-512 implementation is ready

[Benchmark]
[BenchmarkCategory("avx512")]
public unsafe void SIMDUtf8ValidationRealDataAvx512()
[BenchmarkCategory("sse")]
public unsafe void SIMDUtf8ValidationRealDataSse()
{
if (allLinesUtf8 != null)
{
RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteAvx512);
RunUtf8ValidationBenchmark(allLinesUtf8, SimdUnicode.UTF8.GetPointerToFirstInvalidByteSse);
}
}*/
}

}
public class Program
Expand Down
1 change: 1 addition & 0 deletions src/Ascii.cs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ public unsafe static class Ascii

public static bool IsAscii(this string s)
{
if (s == null) return true;
foreach (var c in s)
{
if (!c.IsAscii()) return false;
Expand Down
Loading