Skip to content

Commit

Permalink
Not bad.
Browse files Browse the repository at this point in the history
  • Loading branch information
Daniel Lemire committed Feb 23, 2024
1 parent cb9bee8 commit 17c224d
Show file tree
Hide file tree
Showing 10 changed files with 353 additions and 1,125 deletions.
57 changes: 17 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,21 @@

This is a fast C# library to process unicode strings.

*It is currently not meant to be usable.*

## Motivation

The most important immediate goal would be to speed up the
`Utf8Utility.GetPointerToFirstInvalidByte` function.
We seek to speed up the `Utf8Utility.GetPointerToFirstInvalidByte` function. Using the algorithm used by Node.js, Oracle GraalVM and other important systems.

https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs


(We may need to speed up `Ascii.GetIndexOfFirstNonAsciiByte` first, see issue https://github.com/simdutf/SimdUnicode/issues/1.)

The question is whether we could do it using this routine:
- John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021

* John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
The function is private in the Runtime, but we can expose it manually.

Our generic implementation is available there: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h
https://github.com/dotnet/runtime/blob/4d709cd12269fcbb3d0fccfb2515541944475954/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs

Porting it to C# is no joke, but doable.

## Requirements

We recommend you install .NET 7: https://dotnet.microsoft.com/en-us/download/dotnet/7.0
We recommend you install .NET 8: https://dotnet.microsoft.com/en-us/download/dotnet/8.0


## Running tests
Expand Down Expand Up @@ -62,33 +54,6 @@ cd benchmark
sudo dotnet run -c Release
```

Still under macOS or Linux, you can change the filter parameter to narrow down the benchmarks you'd like to run:

```
cd benchmark
sudo dotnet run -c Release --filter *RealData*
```

To get a list of all available tests you may enter:

```
cd benchmark
sudo dotnet run -c Release --list tree
```

To get a prettier list in tree format, you may enter:

```
cd benchmark
sudo dotnet run -c Release --list tree
```

To run all benchmarks, you may enter:

```
sudo dotnet run -c Release runall
```

## Building the library

```
Expand All @@ -105,6 +70,18 @@ cd test
dotnet format
```

## Programming tips

You can print the content of a vector register like so:

```C#
public static void ToString(Vector256<byte> v)
{
Span<byte> b = stackalloc byte[32];
v.CopyTo(b);
Console.WriteLine(Convert.ToHexString(b));
}
```

## More reading

Expand Down
2 changes: 1 addition & 1 deletion benchmark/ASCII_runtime.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
//Vector128's by 16 (see:https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs,eb1e72a6f843c5a5)
// GetIndexofFirstNonAsciiByte is no longer internal

namespace Competition
namespace DotnetRuntime
{
// I copy pasted this from: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/CompExactlyDependsOnAttribute.cs
// Use this attribute to indicate that a function should only be compiled into a Ready2Run
Expand Down
Loading

0 comments on commit 17c224d

Please sign in to comment.